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ABSTRACT 


A  class  of  grammars,  called  terminal  context  grammars,  is  defined.  Terminal 
context  grammars  have  context-sensitive  productions  which  are  context-free 
productions  with  identical  strings  of  terminals  at  the  end  of  the  left  and  right 
sides.  Properties  of  these  grammars  are  investigated  and  a  method  for  pars¬ 
ing  a  subclass  of  them  is  presented.  A  major  use  for  terminal  context  gram¬ 
mars  is  to  generate  them  from  context-free  grammars  and  then  parse 
according  to  the  result.  It  is  shown  that  this  provides  a  practical  method  of 
parsing  grammars  which  are  LR(k),  with  k  greater  than  one  allowed.  Several 
methods  of  producing  terminal  context  grammars  from  context-free  ones  are 
given,  with  each  one  embodying  different  trade-offs  between  parsing  table  size 
and  error  handling  ability. 
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i.  INTRODUCTION 


The  use  of  formal  grammars  to  describe  the  syntax  of  programming  languages  has 
gained  wide  acceptance.  Among  the  advantages  of  this  practice  is  the  existence  of 
parser  generator  programs  which  can  produce  parsers  for  a  language  from  its  gram¬ 
mar.  This  means  that  a  parser  that  is  guaranteed  to  agree  with  the  grammatical  char¬ 
acterization  of  a  language  can  quickly  be  generated. 

The  applications  of  this  technique  are  not  limited  to  computer  programs:  the  input 
to  many  computer  programs  often  has  a  structure  that  is  easily  described  by  a  gram¬ 
mar.  It  has  been  suggested  that  many  data  processing  programs  can  be  represented 
by  a  “program”  which  is  essentially  a  grammar  with  some  augmentations  [Silverberg 
78].  However,  the  parsing  of  computer  programs  as  the  first  stage  of  a  compiler  is  the 
prime  example  of  where  the  technique  is  used. 

1.1  The  Problem 

LR(k)  parsers,  introduced  by  Knuth  [Knuth  65],  are  the  most  powerful  of  the  ones  using 
the  conventional  parsing  model.  That  model  has  a  stack,  an  input  queue,  and  a  finite 
state  control.  Parsing  is  done  deterministically  in  one  left-to-right  pass  over  the  input 
and  only  a  finite  amount,  of  information  is  used  to  make  each  parsing  decision.  While 
LR(k)  grammars  are  not  perfect  for  describing  even  the  surface  syntax  of  program¬ 
ming  languages,  they  are  quite  good. 

A  problem  is  that  LR(k)  parsers  take  up  a  lot  of  space,  both  in  generation  and  in 
their  final  tabular  form.  Because  of  this,  the  liALR(k)  modification  of  the  LR(k)  method 
[DeRemer  69]  is  the  one  recommended  for  use  in  modern  compilers.  In  fact,  even 
LALR(k)  parsers  may  be  too  large  if  fc>l.  Thus  the  most  general  LR  method  that  seems 
to  be  in  production  use  is  the  LALR(l)  method. 

It  is  not  difficult  to  write  a  LALR(l)  grammar  which  generates  a  given  programming 
language,  so  most  people  live  happity  with  such  a  parser  generator.  But  in  the  course 
of  writing  a  LALR(l)  grammar,  it  is  sometimes  necessary  to  represent  the  structure  of 
the  language  in  a  manner  which  is  less  than  satisfactory.  The  way  that  the  grammar 
for  a  language  is  built  up  out  of  various  subgrammars  is  important  because  of  the 
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processing  that  follows.  The  actions  taken  by  the  processing  program  take  their  cue 
from  the  various  phrases  analysed  according  to  the  grammar. 

For  example,  to  make  a  LALR(l)  grammar  it  is  sometimes  necessary  to  have  several 
strings  generated  by  the  same  nonterminal  in  spite  of  the  fact  that  they  must  be  pro¬ 
cessed  completely  differently.  This  often  means  that  a  general  “holding”  data  struc¬ 
ture  must  be  used  to  accumulate  data  for  later  processing,  when  the  particular  class  of 
the  input  has  been  decided.  At  other  times  special  schemes  are  put  into  the  scanner, 
such  as  introducing  extra  lookahead  or  using  outside  information  to  differentiate 
between  lexically  identical  tokens. 

Ideally,  it  should  be  unnecessary  to  suit  the  grammar  to  the  LALR(l)  technique  in 
this  manner.  One  should  be  able  to  write  a  grammar  to  be  easily  understood  and  easily 
adaptable  to  later  processing  of  t.he  parsed  representation  of  the  input.  In  other 
words,  the  grammar  should  be  “natural”.  Of  course,  it  is  well-known  that  any  language 
generated  by  an  LR(k)  grammar  can  be  generated  by  an  LR(1)  grammar  [Knuth  65].  so 
theoretically  there  would  seem  to  be  no  point  in  providing  a  method  which  allows  k>  1. 
However,  later  examples  will  show  that  sometimes  an  LR(l)  grammar  is  awkward  to  use 
compared  to  an  LR(2)  grammar  for  the  same  language. 

1.2  Thesis  Objectives 

This  thesis  will  investigate  a  method  of  parsing  grammars  which  are  LR(k),  with  k  possi¬ 
bly  greater  than  one,  without  the  size  blowup  that  occurs  with  Knuth’s  parser  genera¬ 
tor.  “Natural”  grammars  for  programming  languages  seem  to  require  the  full  power  of 
LR(k)  in  only  a  few  localized  instances.  The  method  of  this  thesis  takes  advantage  of 
this  to  automatically  limit  the  places  where  size  increases  occur. 

The  method  involves  generating  a  Terminal  Context  Grammar  from  a  context-free 
grammar  by  adding  identical  strings  of  terminals  to  both  the  left  and  right  sides  of  cer¬ 
tain  productions.  A  motivation  for  this  is  as  follows:  by  using  k  symbols  of  lookahead, 
an  LR(k)  parser  can  be  viewed  as  parsing  a  context-sensitive  grammar  wherein  the 
lookahead  strings  are  at  the  end  of  the  left  and  right  sides.  But  the  grammar  that  is 
being  parsed  has  context  of  length  k  on  every  production.  What  this  thesis  proposes  is 
that  context  is  only  needed  on  some  of  the  productions.  By  explicitly  representing  the 
context  sensitive  part  in  the  grammar  it  becomes  possible  to  achieve  such  a 
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differentiation  in  the  amount  of  lookahead  used. 


It  might  also  be  possible  to  approach  the  same  problem  by  looking  at  modifications 
to  the  LR(k)  machine  and  parser  generator.  The  terminal  context  approach  was  used 
for  several  reasons: 

•  It  seems  to  expose  the  underlying  mechanism  more  clearly.  This  is  an  advan¬ 
tage  when  it  comes  to  various  proofs.  Also,  some  of  the  theory  can  be 
adapted  from  the  work  of  Turnbull  on  deterministic  type-0  grammars  [Turn- 
bull  75], [Turnbull  and  Lee  79]. 

•  There  may  be  occasions  where  it  is  advantageous  to  use  a  terminal  context 
grammar  which  generates  a  different  language  from  the  underlying  context- 
free  grammar. 

1.3  Previous  Work 

Given  some  grammar,  one  way  of  getting  the  full  power  of  LR(k)  for  a  grammar  without 
the  large  size  increase  is  the  method  of  Korenjak  [Korenjak  69].  This  'partitions  the 
grammar  into  a  number  of  subgrammars;  the  LR(k)  parser  is  built  for  each  and  the 
parsers  are  joined  together.  If  the  correct  partitioning  scheme  is  chosen  then  the  full 
power  of  LR(k)  can  be  brought  to  bear  where  it  is  needed.  The  problems  are  that  there 
is  no  obvious  scheme  for  choosing  the  partitioning  (Korenjak  calls  it  “somewhat  of  a 
trial  and  error  process”  [Korenjack  69,  p.  621]),  the  full  LR(k)  parsers  for  the  subgram¬ 
mars  may  be  larger  than  necessary,  and  the  method  is  still  not  practical  for  k>  1. 

Another  method  that  is  often  mentioned  as  a  solution  when  the  LALR(l)  method  fails 
but  the  L,R(k)  method  succeeds  is  the  state  splitting  technique  [DeRemer  6S],[Aho  and 
Ullman  72].  The  idea  is  that  the  difference  between  LALR  and  LR  is  that  the  former 
merges  together  states  that  are  separate  in  the  latter,  and  sometimes  this  causes  a 
loss  in  the  ability  to  tell  precisely  what  symbols  can  come  next.  Tf  the  appropriate 
states  of  a  LALR  machine  are  split,  then  the  resulting  parser  will  act  locally  like  the  LR 
parser.  Pager  described  a  “lane  tracing”  algorithm  for  deciding  which  states  are  to  be 
split  [Pager  1973],  It  splits  the  starting  states  of  produc Lions  which  end  in  nondeter- 
ministic  states,  until  the  machine  is  deterministic  or  no  more  splitting  can  be  done. 
State  splitting  after  the  machine  has  been  built  may  not  be  the  best  -way  to  approach 
the  problem  because  two  separate  splits  may  yield  states  that  can  be  merged  but  this 
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is  hard  to  recognize  with  the  somewhat  heuristic  approach  to  state  splitting  that  has 
been  given  in  the  literature.  In  contrast,  the  methods  of  this  thesis  will  develop  split 
states  as  a  by-product  of  the  grammar,  so  that  well-established  parser  generation 
theory  can  be  used  to  decide  when  states  may  be  merged. 

As  for  introducing  fc>l  in  localized  places  in  the  parser,  there  is  the  fairly  obvious 
method  of  calculating  SLR  or  LALR  sets  for  k>  l  in  only  those  places  where  k-  1  did  not 
work,  and  encoding  the  parser  so  that  the  greater  lookahead  mechanism  is  invoked 
only  when  it  is  needed.  Pager  describes  this  for  the  SLR  ease  in  [Pager  73];  nobody 
seems  to  have  published  a  practical  method  for  calculating  LALR  lookahead  sets  for 
k>  1.  A  problem  with  this  method  is  that  table  encoding  and  compacting  algorithms 
have  all  been  developed  under  the  assumption  that  only  one  symbol  of  lookahead  is 
needed  to  make  a  parsing  decision.  It  would  be  nice  to  find  a  method  for  having  k>  1 
which  can  use  the  existing  table  encoding  techniques. 

The  work  of  this  thesis  is  in  many  ways  an  in-depth  look  at  a  subclass  of  the  gram¬ 
mars  and  parsers  investigated  by  Turnbull  [Turnbull  75].  He  generalized  the  LR(k) 
parser  model  to  allow  the  front  of  the  input  queue  to  be  used  as  a  stack,  on  which  arbi¬ 
trary  strings  of  terminals  and  nonterminals  could  be  pushed,  while  r  etaining  the  single 
left-to-right  deterministic  pass  over  the  input.  Some  of  his  results  regarding  the  gram¬ 
mars  that  can  be  parsed  with  this  model  apply  to  terminal  context  grammars. 

Other  work  on  context  sensitive  grammars  includes  that  of  [Waiters  70]  and  [Rdv£sz 
71 J.  Neither  of  these  seems  to  be  as  practical,  a  base  to  work  from  as  Turnbull's,  for 
reasons  mentioned  in  his  thesis  [Turnbull  75,  pp.  1-10,1-11]. 

1.4  Thesis  Overview 

After  a  review  of  notation  and  definitions  in  chapter  2,  the  concept  of  terminal  context 
grammars  is  introduced  in  chapter  3.  The  motivation  for  them,  is  given  and  an  investi¬ 
gation  into  the  languages  generated  by  them  is  made.  Also  given  is  the  parser  model 
used,  which  is  taken  from  [Turnbull  75].  It  turns  out  that  terminal  context  grammars 
are  theoretically  no  more  powerful  than  context-free  grammars,  but  this  does  not 
mean  that  some  languages  are  not  easier  to  describe  using  terminal  context  gram¬ 
mars. 
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The  heart  of  the  thesis  is  chapter  4.  It  begins  with  the  description  of  a  parser  gen¬ 
erator  which  is  an  improvement  of  Turnbull’s  parser  generator  for  terminal  context 
grammars.  Only  a  subset  of  terminal  context  grammars  can  be  parsed  using  the 
parser  model,  and  this  is  characterized.  Then  various  ways  of  turning  context-free 
grammars  into  terminal  context  grammars  are  described.  The  relationship  between 
their  parsers  and  the  LR(k)  parsers  for  the  corresponding  context-free  grammars  is 
investigated,  and  it  is  shown  that  none  of  the  power  of  LR(k)  is  lost. 

Chapter  5  is  a  discussion  of  the  practicalities  of  the  techniques  described  in  the 
thesis.  It  is  shown  that  the  methods  can  be  used  with  only  a  small  space  penalty  and 
no  time  penalty  compared  to  a  LALR(l)  parser  for  the  equivalent  grammar.  If  the 
equivalent  grammar  is  not  LALR(l)  then  the  penalty  is  of  no  consequence  because  of 
the  gain  in  expressibility.  A  parser  generator  was  implemented  to  collect  statistics  for 
this  chapter,  and  to  show  its  feasibility. 

The  final  chapter  evalutat.es  the  utility  of  the  methods  developed.  In  particular, 
more  examples  are  given  to  show  how  LR(k)  for  k>  1  is  sometimes  quite  useful.  Some 
conclusions  are  given,  as  well  as  some  suggestions  for  future  work. 
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2.  DEFINITIONS 


This  chapter  defines  the  notation  and  terminology  used  in  the  rest  of  the  thesis.  Most 
of  the  concepts  should  be  familiar  to  the  reader,  so  they  are  given  in  a  concise  set  of 
definitions. 

2. 1  Languages 


Definition.  An  alphabet  is  a  finite  set  of  symbols.  A  s Lring  (over  an  alpha¬ 
bet,  T)  is  a  sequence  titz...tn  where  each  f*  is  a  member  of  T  and  n^O. 

The  empty  string  will  be  represented  by  the  symbol  The  length  of  a 
string  a  will  be  represented  by  j  ex  j .  A  language  (over  an  alphabet,  T)  is  a 
possibly-infinite  set  of  strings  over  T.  Q 

The  following  functions  are  useful  for  working  with  strings. 

Definition.  The  FIRSTj.  function  chops  off  its  argument  string  after  k 
characters  if  it  is  that  long.  I.e.,  FIRSTk{x)-x  if  \x\  otherwise  it 
equals  u  such  that  x-uv  and  \u\-k.  The  functionality  of  FIRSTk  is 
extended  to  sets  of  strings  in  the  obvious  manner. 

The  PREFIXk  function  is  defined  by 

PREFIXk(x)  =  \y  |  x=yz ,  for  some  string  z  ,  \  y  j  Ssfc  ^ 

In  the  above  equation,  y  may  be  the  empty  string  or  x  itself.  Again,  the 
functionality  of  PREFIX k  is  extended  to  sets  of  strings.  Sometimes  the 
subscript  k  will  be  omitted  from  ' PREFIX^' ,  in  which  case  there  is  no 
length  restriction  on  the  result  set  elements.  □ 

There  are  numerous  ways  of  describing  languages  in  terms  of  simpler  languages.  As 
well  as  the  normal  set  operations  (such  as  union,  vj),  the  following  notations  are  con¬ 
venient,. 

Definition.  The  product  of  two  languages  L\  and  L2  is  the  set  of  words 
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formed  by  concatenating  any  word  from  L  j  to  any  word  from  L 2.  It  is  writ¬ 
ten  as  L  2  or  just  L  jL2  if  the  meaning  is  clear. 

The  powers  of  a  language  are  defined  by  Lc=[\]  and  Ln-L-Ln~x  when 
n^l. 

The  reflexive  transitive  closure  and  the  transitive  closure  of  a  language 
L  are  given  by  (respectively) 

L*  =  L°  u  L1  u  L2  u  ... 

L+  =  L1  u  A2  u  L2  u  ... 


□ 

Often  a  single  symbol  will  be  used  as  a  "language”  so  that  the  above  notations  apply 
to  single  symbols  too.  For  example,  a*  =  lA.a.aa.aaa....^. 

2.2  Grammars 

More  complicated  languages  can  be  described  by  means  of  a  phrase  structure  gram¬ 
mar  (simply  called  grammar  in  this  thesis). 

Definition.  A  grammar  is  a  4- tuple,  <N,T,S,P>  where 

•  N  is  a  finite  set  of  nonterminal  symbols. 

•  T  is  a  finite  set  of  terminal  symbols  (N nT  =  0).  Sometimes  V  is  used 
as  a  short  form  for  Ar\J7\ 

•  S'  is  a  distinguished  member  of  N,  the  sta.rt  symbol. 

•  P  is  a  finite  set  of  productions  of  the  form  u~*v  where 
ue(NvT)+,  v  e(vV U77)*.  u  is  called  the  left  hand  side  (Ihs)  and  it  must 
contain  at  least  one  member  of  N.  v  is  called  the  right  hand  side. 

□ 

The  productions  of  a  grammar  are  used  as  rewrite  rules  to  generate  strings  in  T*  as 
described  in  the  following  definition. 

Definition.  Lot  G  =  <N,T,S,P>.  A  string  w '  is  immediately  generated 
from  a  string  w  if  and  only  if 


2-2 


vj=sut,  vj'—svt,  and  u~*v  £  P,  s,  £  £  K* 


This  is  written 

vj  =>c  w'  (or  vj  -=>vo'  if  G  is  understood) 

A  string  vj  ’  is  generated  from  a  string  w  if  either  vj’  -w  or  there  are 
w0,  ...  ,  wn  such  that  w=w0,  wn-u>'  and  voi  immediately  generates  wi+1 
for  each  i,  This  is  written 

vj  =>  *  vj'  (or  vj  =>*  w'  if  G  is  understood) 

The  word  derivation  is  often  used  to  describe  the  sequence  of  rewrites 
used  to  generate  one  string  from  another. 

Sometimes  the  concept  of  canonical  derivation  is  important.  This  is  a 
sequence  of  rewriting  steps  such  that  at  no  time  is  a  rule  applied  to  a  part 
of  the  string  wholely  to  the  right  of  the  previously  replaced  string.  If  the 
derivation  is 

Wi-SiUit  i  =>  SlVltl  =  S2U2t2  =>  ...  =>snvntn=vun 

then  it  is  a  canonical  derivation  if  and  only  if  si+]ui+ ^PREFIX  {s^Vj)  for  all 
i,  l^i^n  —  l.  This  is  written  as 

w  j  = >q  vun  (or  vjr  =£ > *  von  if  G  is  und  erstood) 

A  sentential  form,  is  a  string  iv £7*  such  that  S  —j>*vj.  A  sentence  is  a  sen¬ 
tential  form  in  T* . 

The  language  generated  by  a  grammar  G  is 
L{G)  —  \w  |  vocT *,  S  =>gw] 

□ 

Chomsky  classified  grammars  into  four  types  depending  on  the  form  of  productions 
allowed.  These  classes  are: 

•  A  type-0  grammar  (TOG)  has  no  restrictions  on  the  form  of  productions  (other 
than  those  mentioned  in  the  definition). 
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•  A  context-sensitive  grammar  (CSG)  is  one  in  which  all  productions  u-*v  are 
such  that  \u  j  ^  !  v  ! . 

•  A  context-free  grammar  (CFG)  is  one  in  which  all  productions  u->v  are  such 
that  u  is  always  a  single  nonterminal,  and  v  is  nonempty. 

•  A  regular  grammar  is  one  in  which  all  productions  are  of  the  form 
A^Bx,  A,B  GiV,  x€T*  or  A-*y,  AeN,  yeT+.  (Alternatively,  ail  productions  can 
be  of  the  form  A  -*xB  or  A  ->y). 

The  empty  string,  A,  is  usually  allowed  in  the  last  three  types  of  grammars  by  permit¬ 
ting  a  single  production  S-*\  when  S'  is  the  goal  symbol  and  there  are  no  productions 
with  5  on  the  right  hand  side. 

One  can  also  talk  about  type-0,  context-sensitive,  context-free,  and  regular 
languages,  which  are  languages  that  can  be  generated  by  some  grammar  of  the 
corresponding  type. 

2.3  Textual  Conventions 

The  following  textual  conventions  will  hold  throughout: 

•  Large  Roman  letters  { A,B,C ,...)  are  used  for  nonterminals.  Also,  sometimes 
capitilized  words  are  used  for  nonterminals  (e.g..  Statement). 

•  Small  Roman  letters  near  the  beginning  of  the  alphabet  (a,  b,  c,...)  are  used  for 
terminals.  Strings  enclosed  in  single  quotes  may  also  be  used  for  terminals 
(e.g.,  ‘begin’). 

•  Small  Roman  letters  near  the  end  of  the  alphabet  (...,x,y,z)  arc  used  for 
strings  of  terminals. 

®  Small  Greek  letters  (a,  (3,  7,  •  •  )  are  used  for  strings  of  terminals  and  nonter¬ 

minals. 

Often  grammars  will  be  presented  simply  by  listing  their  productions.  It  will  be  under¬ 
stood  that  the  left  hand  side  of  the  first  production  is  the  goal  symbol.  The  terminal 
and  nonterminal  sets  can  be  found  by  the  above  conventions  or  by  their  usage  in  the 
productions. 


3.  TERMINAL  CONTEXT  GRAMMARS 


This  chapter  will  define  and  discuss  the  properties  of  a  class  of  grammars  which  is  a 
subclass  of  the  context-sensitive  grammars.  The  grammars  of  the  subclass  are  res¬ 
tricted  to  have  only  strings  of  terminals  as  context  and  only  context  at  the  right  end. 

Definition.  A  terminal  context  grammar  (TCG)  is  a  4-tuple  <N,T,S,P> 
where 


N  is  the  set  of  nonterminals 
T  is  the  set  of  terminals  (V=N'JT) 

51  is  the  start  symbol 

P  the  set  of  productions,  of  the  form  Ax->ctx 
(xe7'*,AeN,  aeV*).  The  string  x  in  such  a  production 
will  be  called  the  terminal  context  string. 

A  fc-TCG  is  a  TCG  where  the  terminal  context  strings  are  all  of  length  k  or 
less  (i.e.,  \x  |  Sk  in  the  above  definition).  A  full  k-TCG  is  a  fc-TCC  in  which 
every  production  has  a  terminal  context  of  length  equal  to  k.  Q 

3.1  Power  of  TCGs 

One  might  hope  that  the  terminal  context  would  allow  the  generation  of  languages  that 
are  not  context-free.  Unfortunately,  the  TCGs  generate  exactly  the  context-free 
languages.  This  will  be  shown  by  appealing  to  the  relationship  between  push-down 
automatons  and  context-free  languages. 

Definition.  A  push- down  automaton  (PDA)  is  a  device  which  manipulates 
configurations  according  to  certain  rules.  A  configuration  is  a  string  over 
U*QT *,  where  U,Q,T  are  finite  sets:  U  is  the  stack  alphabet,  Q  is  a  set  of 
states,  and  T  is  a  set  of  terminals.  A  configuration  7  q  y  is  meant  to  mean 
that  the  finite  state  control  of  the  PDA  is  in  state  q,  with  7  being  the 
current  stack  contents  (top  of  stack  at  the  right)  and  y  being  the  unused 
portion  of  the  input  string. 
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The  rules  of  the  PDA  are  of  one  of  the  two  forms: 


i)  A  qt  b  a  qjt  AeU,  a€U*,  q^q^Q 

ii)  A  qi  a  h  a  g^,  A^U,  aCUM,  qi,q^Q,  a^T 

A  rule  of  type  (i)  uses  the  stack  top  and  the  current  state  as  context  to 
allow  a  move  like 

PA  qiV  Pol  qj  y 

in  which  the  state  changes  from  g*  to  q3,  the  A  is  popped  off  the  stack,  and 
a  is  pushed  on  instead.  A  rule  of  type  (ii)  uses  the  stack  top,  the  current 
state,  and  the  next  input  symbol  as  context  to  allow  a  move  like 

PA  qt  ay  b  /So  q3-  y 

as  in  rule  (i),  except  that  a  terminal  is  read  too. 

The  initial  configuration  is  of  the  form  Zs  qs  w,  where  Zs  and  qs  are 
some  specified  initial  stack  symbol  and  initial  state,  respectively,  and  w  is 
the  input  string.  A  string  is  accepted,  if  Zs  qs  vj  b  a  q  for  some  qcF,  a  set 
of  final  states.  |j 

Notice  that  a  PDA  is  a  nondeterministic  machine.  This  is  essentially  the  definition 
given  in  [Aho  and  Ullman  72],  with  some  slight  differences  in  notation.  They  then  define 
an  extended  PDA  which  allows  moves  to  be  made  on  the  basis  of  finite  strings  of  top-of- 
stack  context  rather  than  single  symbols,  and  show  that  for  every  extended  PDA  there 
is  a  PDA  which  accepts  the  same  language.  Here,  an  extended  PDA  will  be  defined  to 
allow  strings  of  head-of-input  context  as  well. 

Definition.  An  extended  PDA  is  the  same  as  a  PDA,  except  that  the  rules 

for  moves  are  of  the  type: 

i)  P  qi  x  I—  a  q3  x,  ex,  {3<Z.U*,  qi.qjCQ,  x€.TM 

ii)  (3  qi  ax  b  a  x,  a,  qi.qjCQ,  xcT*,  a<E.T 

In  the  above  rules,  call  (3  the  top-of-stack  context  and  call  x  (or  ax  in  rule 
(ii))  the  head-of-input  context,  n 
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Informally,  one  can  see  that  the  machine  still  treats  the  stack  portion  of  its  string 
as  a  stack,  and  still  doesn’t  use  the  input  portion  of  the  string  for  any  storage,  so  that 
one  would  expect  that  it  would  have  exactly  the  same  power  as  a  PDA.  This  will  now  be 
shown  more  formally. 

Lemma.  Given  an  extended  PDA,  P,  there  is  a  PDA,  P\  which  accepts 

L{P)]_k  ,  where  1  is  not  in  P's  terminal  set  and  k  is  defined  below. 

Proof.  Let  m  be  the  maximum  length  of  the  top-of-stack  context  used  in  a 
rule  of  P.  Let  k  be  the  maximum  length  of  the  head-of-input  context  used 
in  a  rule  of  P.  The  PDA  P’  will  store  the  top  m  characters  of  the  stack  and 
the  next  k  characters  of  the  input  in  its  finite  state  control.  Thus  the 
states  of  P'  will  be 

Q' =  {  [aqx]\q  eQ,  a€U*,x€T*,  \a\£m,  |a;|£fc) 

The  rules  of  P'  are  derived  from  the  rules  of  P  as  follows: 

a)  For  each  rule  /3  x  h  a  q3-  z,  where  z  is  either  x  or  x'  such  that 
ax'  —  x,  do: 

i)  If  |/?|^|a|  add  these  rules: 

[7/?  <]i  xy]  P  0  [ <5  qi  zy’]  for  all  y,y,  <5,  6  such  that 
yf3<rUm,  xy€Tk,  6,6cU*,  $6=70,  |<5|=m, 
i  zy'\=k,  y'cPREFFX  (y  ). 

(  1 7a  |  -  m-\@  |  +  |  « |  km,  so  this  is  possible.) 

ii)  If  |jS|>|a|  add  these  rules: 

A  [yfi  qt  xy]  b  [A  a  q$  zy']  for  all  A,  7  ,y  such  that 

-y(3e{Jm,  xy  e  Tk,  AeU,  \zy'\=k ,  y'ePREFIX (y). 

b)  To  get  to  a  state  where  m  stack  symbols  are  known  from  situations 
like  a)ii),  the  following  rules  must  be  added: 

A  [(3  q  x]  b  [A  q  x] 

for  all  A  eU,  qcQ,  xcTk,  \(3\<m 

c)  To  get  to  a  state  where  k  input  symbols  are  known  from  situtations 
where  z  is  a  proper  suffix  of  x,  and  during  startup,  the  following 
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rules  must  be  added: 


[P  7  x]a  I -  [0  q  xa  ] 

for  all  (3£Um,  A^U,  q€Q,  acT,  x ^T*,  |  x  j  <m 

The  PDA  will  start  with  the  configuration  [Z'm~1Zs  qs  ]  u>  J_  k  where  w  is  the 
string  that  would  be  given  as  input  to  P. 

Now,  any  configuration  a  q  y  of  P  will  have  equivalent  configurations 
or  i  [ex 2  q  y  i]  t/2  in  P\  with  Z’m~1a  =  a1a2  and  y  \_k  -  y  \y2.  Whenever 
!  otg  |  —Tn  and  |a;1i=lfct  the  only  moves  that  P'  can  make  are  those 
corresponding  to  moves  that  P  can  make,  and  it  can  make  all  of  those. 
Otherwise,  the  only  moves  that  P'  can  make  arc  ones  which  keep  the  state 
in  the  same  equivalence  class,  but  increase  the  length  of  either  a2  or  V  i- 
Therefore,  the  PDA  P'  accepts  the  same  language  as  the  extended  PDA  P 
except  for  the  extra  1  ’s  needed  for  the  former.  Q 

The  above  definitions  have  been  given  so  that  the  following  well-known  theorem  can 
be  used.  It  was  proven  in  [Chomsky  62]. 

Theorem.  A  language  L  can  be  generated  by  a  context-free  grammar  if 
and  only  if  it  is  accepted  by  some  PDA.  Q 

This  theorem,  together  with  the  previous  one,  shows  that  extended  PDA’s  also  accept 
exactly  the  context-free  languages.  Thus  it  can  be  shown  that  A-TCGs  generate  exactly 
the  context-free  languages  by  showing  that  fc-TCGs  can  be  recognized  by  extended 
PDA’s. 

Theorem.  Given  a  fc-TCG,  G ,  which  has  been  augmented  with  k  j_’s,  there 
exists  an  extended  PDA,  P,  which  accepts  L  ( G ). 

Proof.  P  will  do  a  nondeterministic  bottom-up  parse  of  a  string  in  L(G). 

Its  rules  will  allow  any  shifts  onto  the  stack  and  also  any  reduction  of  a 
production  whose  right  side  is  on  the  top  of  the  stack.  In  the  case  of  pro¬ 
ductions  with  added  right  context,  the  context  will  just  be  looked  at  while 
it  is  still  at  the  front  of  the  input. 
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The  rules  of  P  are  his  follows: 

a)  q  a  b  a  q  for  all  a  £  T  ( T  is  the  terminal  set  of  both  G  and  P). 

b)  a  q  I-  A  q  where  A  ->a  is  a  production  of  G. 

c)  a  q  x  \-  A  q  x  where  Ax  ->cer  is  a  production  of  G. 

d)  jfZ  q  I-  qj  where  Z  is  the  goal  symbol  of  G. 

The  starting  stack  symbol  is  #,  the  starting  state  is  q  and  the  final  state  is 

?/• 

Now  it  will  be  shown  that: 

#  q  w  \~*  qj  <=>  Z^>*vj 

=£>  If  P  reaches  state  qj,  it  must  have  gone  through  a  series  of 

moves  like: 

#  q  W  =  Qx  h  02  h  •  •  '  0n  =  9/ 

If  the  initial  #  and  the  internal  q  are  removed,  the  sequence 
6n-i,  •  •  ■  ,  62,  6\  is  a  rightmost  derivation  in  G  (though  some  of 
the  adjacent  0’s  may  be  identical  after  the  modification).  To  see 
this,  note  that: 

a)  In  going  from  6i  to  i,  either  the  q  moves  left,  leaving 
the  underlying  sentential  form  the  same,  or  the  nontermi¬ 
nal  immediately  to  the  left  of  the  q  is  replaced  by  a  string. 
But  this  latter  action  takes  place  only  when  the 
corresponding  derivation  step  is  allowed  in  G.  The  look¬ 
ahead  rules  insure  that  added  right  context  productions 
are  used  only  in  the  correct  context. 

b)  Because  the  q  either  stays  at  the  same  spot  or  moves  left, 
the  derivation  must  be  rightmost. 

This  shows  the  =>  direction. 

<=  If  u> £L  (G ),  there  is  a  derivation: 
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Z  {3]A  \—^?A  zx2^>  '  '  '  =>w 


where  x^T*,  xi  is  a  suffix  of  xi+l,  and  is  the  nonterminal  on 
the  left  side  of  the  rule  used  -  this  does  not  exclude  added  right 
context  rules. 

The  above  derivation  can  be  used  to  construct  a  valid 
sequence  of  moves  for  P.  Starting  with  the  configuration  §  q  iv, 
get  the  next  configuration  by  reading  the  above  derivation  back¬ 
wards.  If  Xi  is  a  proper  suffix  of  xi+i,  insert  the  appropriate 
number  of  shift  moves  to  make  the  input  string  look  like  ar*. 

Then  oq+1  will  be  on  the  top  of  the  stack,  and  the  reduce  move  to 
Ai  can  be  made.  A  correspondence  between  sentential  forms 
and  machine  configurations  is  thus  set  up,  and  an  inductive 
argument  can  be  used  to  complete  the  proof. 

□ 

Thus  it  has  been  shown  that  A:-TCGs  generate  exactly  the  context-free  languages. 
The  context  in  productions  does  not  increase  the  generative  capabilities  of  the  result¬ 
ing  grammars.  In  fact,  it  seems  certain  that  the  same  holds  for  a  mixture  of  terminal 
left  context  and  terminal  right  context  -  it  would  appear  that  it  is  the  presence  of 
nonterminals  in  the  context  which  gives  context  sensitive  grammars  their  increased 
power.  This  statement  is  just  made  as  an  interesting  conjecture;  it  will  not  be  dis¬ 
cussed  any  further  because  it  is  outside  the  scope  of  this  thesis. 

3.2  Parsing  TCGs 

The  previous  section  has  shown  that  the  advantage  of  TCGs  is  not  an  incease  in  genera¬ 
tive  powrer.  The  advantage  of  TCGs  lies  in  the  properties  of  parsers  for  a  subclass  of 
them.  This  will  be  shown  by  relating  them  to  the  DRP  grammars  introduced  by  Turn- 
bull  [Turnbull  75]. 

Turnbull’s  work  generalized  LR(k)  parsers  by  identifying  and  retaining  their  "good” 
features,  while  expanding  the  scope  of  grammars  handled  to  include  some  type-0  and 
context-sensitive  grammars.  The  "desirable  parser  characteristics”  he  listed  are: 

•  halting  in  all  cases  and  correctly  identifying  the  input  as  to  whether  it  is  valid 
or  not 
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•  time  efficient  (with  practical  considerations):  a  single  left  to  right  scan  of  the 
input  should  be  made,  and  the  control  should  be  deterministic  without  back¬ 
tracking 

•  with  the  “error  property":  no  parsing  action  other  than  “looking”  should  be 
performed  on  any  symbol  past  the  point  where  an  error  can  be  detected  in  a 
left  to  right  scan. 

These  are  certainly  desirable  properties  for  an  application  such  as  a  compiler.  A 
parser  which  has  all  these  properties  is  a  D2SM  with  the  error  property.  Informally,  it 
is  an  LR(k)  parser  with  the  added  ability  to  reduce  a  right  hand  side  to  more  than  one 
symbol. 

Definition  [Turnbull  75].  A  Deterministic  2-Stack  Machine  ( D2SM )  is  a 
device  which  manipulates  configurations  according  to  certain  rules.  A 
configuration  is  a  string  over  VjfQVjf  where  If,  Q,  Vr  are  finite  sets.  Vjt  is 
the  L-stack  alphabet,  Q  is  a  set  of  states,  and  Vr  is  the  R-stack  alphabet. 
Elements  of  Vji  are  all  of  the  form  (7, q)  where  yeF^u^X],  qcQ.  A 
configuration  is  supposed  to  represent  the  total  state  of  the  parser  shown 
in  figure  3. 1. 


L-STACK 


Figure  3.1.  Two  Stack  Machine 
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The  string  to  parse  is  initially  pushed  onto  the  R-stack  with  its  first 
symbol  at  the  left  (which  is  the  top  of  R-stack).  As  parsing  proceeds,  sym¬ 
bols  may  be  “read”  or  “reduces’’  may  occur.  In  the  latter  case,  nonter¬ 
minals  will  be  pushed  on  the  R-stack  so  that  in  general  the  R-stack  con¬ 
tains  a  string  of  (N'^T )*.  The  term  “apply”  is  synonomous  with  “reduce”. 
The  L-stack  contains  a  string  of  composite  symbols  made  up  of  VR  sym¬ 
bols  and  state  numbers  (actually,  only  the  state  numbers  are  necessary). 
A  configuration  f  g  a  means  that  the  finite  control  is  in  state  q,  the  R- 
stack  contains  a  (top  of  stack  at  the  left)  and  the  L-stack  contains  ■p  (top 
of  stack  at  the  right). 

For  a  2-stack  machine,  the  rules  must  be  of  one  of  the  following  forms. 

i)  read:  q< ja  h  (y,q)  q'  a,  yeKgUfXj,  a<zVR,  q.q'eQ 

ii)  lookahead :  q  a  h-  a’  a,  acVR.  q,q'^Q 

iii)  reduce:  (~/,q)i'  q'  r-  q  a,  (7>q)eVL.  o-^Vr,  q.q'^Q 

There  are  some  restrictions  of  the  rules.  For  a  D2SM  these  are: 

•  All  states  must  be  accessed  by  a  unique  symbol,  thus  allowing 
the  stacking  of  only  the  state  on  the  L-stack 

•  It  must  be  possible  to  tell  which  move  must  be  made  by  examin¬ 
ing  only  the  current  state  and  a  bounded  number  of  symbols  on 
the  top  of  the  R-stack 

•  Reads  of  the  empty  string  must  go  to  a  state  which  immediately 
reduces  that  empty  string  to  something 

•  Look-ahead  moves  must  be  to  states  containing  only  reduce 
rules 

Finally,  there  are  some  technical  requirements  which  ensure  that  there 
are  no  useless  states  or  rules. 

•  For  any  state  q,  there  exist  TpcV*  and  acV*  such  that  ip  q  a 
h*  t//6  q'  @  and  there  is  a  reduce  rule  applicable  to  ip 6  q’  f3 
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•  For  any  reduce  rule  (71, gi)  •  •  •  (7 n>ln)  q  b  gi  a,  the  appropri¬ 
ate  read  rules  exist  so  that  it  may  be  used.  I.e., 

V  q  1  7i  •  •  •  7n0  ^“+  ^(71.9  1)  ■  •  •  (7n.?n)  7  0  .for  some  ip,  (3 

The  D2SM  starts  in  a  special  start  state,  q  start  •  and  accepts  the  input  if  it 
gets  to  the  configuration  q accept-  □ 

Definition  [Turnbull  75].  A  D2SM  with  the  error  property  is  a  D2SM  where 
the  following  holds: 

•  If  qsTART  w  I-*  ip  q  a  for  some  w £ Vj*  then  for  each  rule  applica¬ 
ble  to  ip  q  a  there  must  be  a  possible  R-stack  string,  a’,  such 
that  Tp  q  a'  b*  q accept^  and  the  first  rule  used  in  that  sequence 
is  the  rule  in  question. 

•  Removing  the  lookahead  from  the  rules  does  not  affect  the 
language  accepted  by  the  parser  (removing  the  lookahead  will  in 
general  make  the  parser  nondeterministic). 

□ 

In  the  case  of  context-free  grammars,  D2SMs  have  the  same  recognizing  power  as 
LR(k)  parsers. 

Theorem  [  follows  from  theorems  in  Turnbull  75].  A  CFG  with  no  useless 
nonterminals  is  LR(k)  if  and  only  if  it  is  recognized  by  a  D2SM  with  the 
error  property.  Q 

The  SLR(k)  and  LALR(k)  parsers  are  D2SMs.  They  don’t  have  the  error  property 
because  they  may  do  some  reductions  after  an  error  could  have  been  detected  by  the 
corresponding  LR(k)  parser;  however  they  almost  have  the  error  property  because 
they  do  not  read  past  the  error  point. 

In  the  next  chapter  a  method  of  generating  D2SM’s  with  the  error  property  from 
some  TCGs  will  be  presented.  The  above  discussion  has  indicated  that  such  parsers  will 
have  desirable  time  efficiency  and  error  properties. 
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3.3  Some  TCG  Advantages 


The  main  advantage  claimed  for  the  TCGs  is  more  power  for  the  space.  It  is  possible  to 
generate  TCGs  for  languages  which  cannot  be  handled  by  LALR(l)  grammars,  yet  which 
can  be  deterministically  parsed  by  a  machine  without  the  large  number  of  states  that 
an  LR(l)  parser  might  have.  Similarly,  it  is  possible  to  handle  lookahead  of  more  than 
one  symbol  without  a  huge  increase  in  the  number  of  states. 

Example.  Given  the  grammar  C{: 

Z  -^>S  J_  Aq 
S'  ->Aa  Aj 
S’  ->  dA  b  A2 
S  ~*Bb  Ag 
S->dBa  A4 
A  ->c  A5 
A.  ->a  Ae 
B  ->c  A7 

Figure  3.2  shows  the  LR(O)  machine,  for  G\,  with  LALR(l)  lookahead  sets. 
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Figure  3.2.  LR(O)  machine  for  G ,  with  LALR(l)  lookaheads 


The  lookahead  transitions  marked  \a,b]  are  nondeterministic,  so  the  grammar  is  not 
LALR(l).  Consider  the  TCG  formed  by  replacing  productions  5  and  7  with  the  following: 
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Aa-^ca  Aeo 
Ab->cb 

0 

Ba->ca  A7 

'  a 

Bb  ->cb  A7i 

b 

This  TCG,  G{,  generates  exactly  the  same  language  as  G Figure  3.3  shows  the  control 
of  a  D2SM  with  the  error  property  to  parse  G{. 
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Figure  3.3.  Parser  for  G  ' 


It  can  be  shown  that  the  above  parser  is  a  D2SM  with  the  error  property.  Thus,  here  is 
a  ease  where  a  parser  can  be  built  using  TCGs  v/hen  the  LALR(l)  technique  does  not 
work.  It  is  true  that  an  LALR(l)  grammar  for  the  language  exists,  but  it  requires 
assigning  different  purposes  to  the  nonterminals  in  the  original  grammar  (i.e.,  the 
strings  produced  by  nonterminals  in  the  new  grammar  would  be  different  from  the 
strings  produced  by  them  in  the  original  grammar).  This  may  not  be  so  convenient  for 
the  purpose  for  which  the  grammar  is  to  be  used.  On  the  other  hand,  the  TCG  gram¬ 
mar  is  the  “same”  as  the  original  grammar  in  the  sense  just  given. 

An  LR(  1)  parser  for  the  original  grammar  would  be  almost  the  same  as  the  the 
parser  in  figure  3.3.  The  only  differences  would  be  that  the  LR(lj  parser  would  have 
different  states  for  the  two  Ae’s,  and  the  productions  5  and  7  would  be  applied  after 
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looking  at  the  following  symbols  rather  than  reading  them.  The  second  difference  is 
not  very  important  because  the  actions  are  very  similar.  The  first  difference  points  to 
the  advantage  of  TCGs:  if  the  grammar  were  only  part  of  a  larger  grammar,  the  LR(l) 
technique  causes  all  sorts  of  split  states  where  the  splitting  may  not  be  needed.  The 
TCGs  can  cause  a  sort  of  automatic  state  splitting  in  a  limited  number  of  places,  leav¬ 
ing  most  of  the  parser  the  same  as  the  LR(0)  machine. 

Sometimes  the  “natural”  grammar  for  a  language  is  not  even  LR(1).  Here  is  an 
example  from  [Hunter, McGettrick  and  Patel  77],  where  the  published  grammar  for  an 
ALG0L68  fragment  was  LR(2). 

Example.  In  ALG0L68  the  following  structure  definition  is  allowed: 
struct  (  real  x,y,  int  i,j) 

Here  is  G2,  an  obvious  grammar  for  such  strings. 

STRAD  -  'struct'  FPACK 
FPACK  ->  (  FIELDS  ) 

FIELDS  ->  FIELDS  ,  FIELD 
FIELDS  ->  FIELD 
FIELD  ->  ‘ad’  LIST 
LIST  ->  ‘tag’ 

LIST  -->  LIST  ,  ‘tag’ 

G2  is  not  LR(l),  because  there  is  a  state  like: 

FIELD  ‘ad’  LIST  • 

LIST  LIST  ‘tag’ 

Due  to  the  overloading  of  the  comma,  it  is  legitimate  for  a  comma  to  follow  a  FIELD  in 
this  state.  The  only  way  to  tell  which  action  is  to  be  done  in  the  above  state  is  to  see 
whether  the  symbol  following  the  comma  is  a  tag  or  an  ad. 

The  following  TCG,  G2,  generates  the  same  language  as  C2 .  The  symbols  have  been 
abbreviated,  letting  S=STRAD,  B  -FPACK,  C -FIELDS,  D  -FIELD,  E  -LIST,  obstruct’, 
b  =‘ad’,  c  =‘tag’. 
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S->aB  Ai 

BAC)  A2 

C-*C,D  A3 
C^D  A4 
D)+bE)  A5) 

D,  b  ->  bE,  b  A5  b 

E  ->c  A« 
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Figure  3.4  shows  a  parser  for  Gz’- 


r__.  i 


a  — m  51  ->a»5 

Lb.1!(C) 


r 


5 


(As) 

i  c 


(£>• 


D 


■j  BA*) 

C^C.D 
C  ->*D 


t  (  . 


j  & 

[  D)->bE*) 
D,b  ->bE*,b 


E  -+E,  »c 
D,b^bE,*b  j 

JJ . 

A?  \ 


5 


Figure  3.4.  Parser  for  C  ' 
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It  is  true  that  an  LALR(l)  grammar  exists  for  the  language,  but  why  not  use  a  technique 
that  allows  the  use  of  a  natural  grammar?  It  could  be  argued  that  compiler  writers 
have  gotten  around  any  problems  with  LALR(t)  parsing,  and  thus,  that  the  TCG  tech¬ 
nique  is  the  solution  for  a  non-problem.  However  it  is  quite  possible  that  people  are  so 
used  to  saying  “lookahead  greater  than  one  symbol  is  impractical”  that  some  useful 
techniques  may  have  been  overlooked. 

J 


3-13 


4.  THE  GENERATION  OF  TCGS  AND  THEIR  PARSERS 


4.1  A  Parser  Generator 

In  this  section  a  method  will  be  presented  for  generating  the  finite  state  control  for  a 
D2SM  to  parse  a  given  TCG.  The  parser  generator  will  be  very  similar  to  the  LR(0) 
parser  generator  and  to  the  parser  generator  presented  by  Turnbull  [Turnbull  75,  ch. 
6].  The  LR(0)  parser  generator  cannot  be  used  directly  because  it  is  defined  for  CFGs; 
Turnbull’s  parser  generator  can  be  improved  for  use  on  TCGs. 

Notation.  Call  the  LR(k)  parser  generator  PGENir^)  and  Turnbull’s 
parser  generator  PCENj)pp^my  The  parser  generator  to  be  developed  in 
this  thesis  will  be  called  simply  PGEN.  The  finite  state  controls  con¬ 
structed  by  the  three  methods  on  a  grammar  G  will  be  referred  to  as 
MiR{k){G),  MDRp{m)(G),  and  M(G),  respectively.  □ 

All  the  parser  generators  to  be  discussed  in  this  section  have  essentially  the  same 
form.  They  build  up  states  made  of  sets  of  items,  like  (Ax  ->afX  a2,y ).  The  quantity 
Ax->cxi'X a2  is  a  marked  'production ,  in  winch  X  is  the  marked  symbol  and  y  is  a  context 
string.  Such  an  item  is  present  in  a  state  when  the  X  of  Az-*c t'LX a2  could  be  read,  and  y 
could  occur  after  the  right  side.  If  ^/  =  A  the  item  may  written  as  simply  (Ax  -+ayX  a2).  A 
state  q  starts  out  with  a  kernel  of  items,  Q(q).  and  the  complete  state  set  is  defined  as 
the  smallest  Q  (q  )  satisfying: 

Q  (<7  )=Q  (<7  )UH  K  is  an  item  in  C (j ),  j£Q(q )] 

At  the  heart  of  the  parser  generators  is  the  closure  function,  C.  It  is  meant  to  find 
items  which  should  appear  in  a  state  when  the  argument  item  is  in  that  state.  The  clo¬ 
sure  functions  are  somewhat  different  In  each  of  the  three  parser  generators. 

The  parser  generators  start  with  the  initial  kernel 
Ks I J.*)  j  S'  is  the  goal  symbol] 

In  the  above,  k  is  a  construction  parameter,  the  length  of  the  maximum  context  on  any 
production  of  the  fc-TCG  to  be  parsed.  The  grammar  is  assumed  to  have  been 
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augmented  with  k  X 's.  Turnbull  uses  the  symbol  m  to  represent  the  analogous  con¬ 
struction  parameter  in  PGEN^p^y 

The  read  transitions  from  a  given  state  are  calculated  in  the  usual  way.  For  each 
YeNuT,  form 

Q{q'  )=\{Ax -> aiY'a&w)  \  (Ax-+aLpY oi2,w)£Q  (q)] 

If  this  is  nonempty  then  there  is  a  read  transition  for  Y  into  the  state  which  is  the  clo¬ 
sure  of  Q  ( q '  ). 

After  a  new  state  is  completed,  it  is  compared  with  previously  formed  states.  If 
there  is  one  which  is  exactly  the  same  as  the  new  state  then  the  new  state  is  not 
needed  —  the  read  transition  should  go  to  that  old  state. 

Those  items  whose  mark  is  at  the  end  of  a  production  have  not  yet  been  considered. 
In  PGENm^)  and  PGENpRp(m)  these  are  handled  by  having  a  reduce  (also  known  as 
apply)  state  accessed  by  a  lookahead  transition  which  is  found  using  the  context 
strings.  PGEN  will  not  use  lookahead  at  all.  Instead,  the  grammars  will  be  written  with 
terminal  context  sufficient  to  ensure  that  all  reduce  states  are  accessed  by  read  tran¬ 
sitions.  If  there  are  any  states  which  contain  a  transition  to  a  reduce  state  and  any 
other  action  at  all,  then  the  machine  is  nondeterministic. 

The  point  of  difference  between  the  three  parser  generators  is  the  closure  function. 
With  a  context-free  grammar  it  is  evident  which  items  should  be  in  a  closure.  If  a  state 
contains  an  item  {A  -+a»B  (3)  then  any  string  derivable  from  B  is  valid  at  that  point,  so  all 
items  like  ( B  ->*y)  are  in  the  closure.  On  the  other  hand,  with  a  type-0  or  context  sensi¬ 
tive  grammar,  productions  whose  left-hand  sides  start  with  the  marked  symbol  may  or 
may  not  generate  items  for  the  closure  depending  on  strings  which  can  validly  follow 
the  mark.  A  closure  function  which  works  sometimes  is 

C  f>2,w)  —  £  {B  7  z)  \  yz  £ PREFIX  ( L  {fizVJ )), 

ztTtk  for  PCENLIHh). 
z  £  V~m  for  PGEN DRP{m), 
z=A  for  PGEN] 

(For  PGENpppfa)  the  L  function  must  return  strings  in  V*,  not  just  71*). 

This  is  essentially  the  closure  function  used  in  all  three  parser  generators.  It  works 
fine  for  CFGs  but  there  are  two  problems  with  it  when  the  input  grammar  is  not 
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context-free. 


a)  When  the  grammar  is  a  TOG  it  is  undecidable  in  general  which  strings  are  in 
L(^2w)-  Since  PGENpj^p^  handles  some  TOGs,  it  needs  to  use  an  algorithm 
which  only  works  sometimes. 

b)  There  may  be  items  which  should  be  in  the  closure  if  the  language  is  to  be 
parsed  correctly,  but  are  not  included  because  §?u>  is  insufficient  context  to  tell 
whether  or  not  7  can  be  seen. 

For  an  example  of  problem  (b)  consider  the  following:  suppose  one  is  calculating 
C{E-**F).  If  the  grammar  contains  a  production  Fa-*ba,  it  is  uncertain  whether 
(Fa->*ba)  should  be  in  the  closure.  It  depends  on  whether  a  can  appear  after  a  reduc¬ 
tion  to  E  is  made  from  this  state. 

PGEN])Rp(m)  tries  to  calculate  the  closure  function  just  given  as  follows.  To  find  ail 
elements  in  PREFIXm{L(P2w ))>  adl  possible  derivations  starting  from  /?2u>  are  tried, 
chopping  off  the  results  at  length  m.  This  is  repeated  on  the  resulting  strings  until 
there  is  no  change  in  the  so-called  set. 

Sometimes  during  this  procedure  a  derivation  can  only  be  made  ‘conditionally’. 
This  occurs  when  there  is  not  enough  context  in  a  string  to  tell  whether  or  not  a  given 
production  can  be  used.  This  may  be  related  to  problem  (a):  the  type-0  productions 
may  cause  a  length-m  string  to  be  derivable  only  from  a  longer  string.  Or,  problem  (b) 
may  be  at  fault:  the  original  string  flow  may  have  insufficient  context.  PGENppp (m) 
gives  tags  to  elements  of  the  Hm  set,  with  all  symbols  resulting  from  a  conditional 
derivation  marked  with  '  l's  in  the  corresponding  tags. 

An  item  can  be  added  to  the  closure  of  )  if  72  £Hm((32w)  and 

all  of  7  is  in  the  unconditional  part  of  a  string  in  However,  if  7  is  there  only 

conditionally  then  such  an  item  cannot  be  added  if  the  resulting  parser  is  to  have  the 
error  property.  PGENpppi^  puts  such  items  in  a  conditional  state  set.  An  item  can  be 
removed  from  the  conditional  state  set  if  a  corresponding  unconditional  item  is  added 
to  the  closure.  At  the  end  of  the  closure  the  conditional  state  set  must  be  empty.  If  it 
isn’t,  PGEN])pprm)  aborts:  either  problem  (a)  or  problem  (b)  has  been  encountered. 

The  closure  function  to  be  used  for  PGEN  is  almost  a  special  case  of  the  one  just 
given: 
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C(£x->pxiBfo)  =  5)  i  yePREFIX\(L(p2)),  x,y<-T*,  A.BeN,  ft,  ft,  6eV*] 


No  context  strings  are  used  because  any  context  needed  will  be  added  to  the  grammar. 
Also,  the  appropriate  Hm  sets  need  only  contain  strings  of  terminals.  The  closure  func¬ 
tion  is  different  from  that  of  Turnbull’s  in  the  calculation  of  the  Hm  sets:  a  new  method 
will  be  presented  in  the  next  section  which  eliminates  problem  (a).  This  is  possible 
because  for  TCGs  one  can  find  exactly  the  set  FIRSTm(L  (ft))  if  the  appropriate  method 
is  used.  Problem  (b)  can’t  be  avoided,  so  PGEN  can  abort  too.  Either  the  grammar 
must  be  changed  or  the  error  property  must  be  sacrificed.  Both  of  these  possibilities 
will  be  discussed  later. 

4.2  Calculating  H 

m 

Due  to  the  inadequacy  of  his  method  for  calculating  Hm,  there  are  some  simple 
l-TCG’s  for  which  Turnbull’s  parser  generator  stops  with  the  ‘conditional  state  set  not 
empty’  condition,  for  large  values  of  m.  In  fact,  given  m,  a  grammar  can  be  con¬ 
structed  for  which  this  occurs: 

Ac  ->BCiCz  •  ■  ■  Cmc 
Cic->c;  C  2c  ~*c ;  Cmc->c 
Be  ->bc 

Here,  Hm(C \C 2-  '  '  Cm)  will  have  to  be  calculated  when  trying  to  close  {Ac->*BC 
The  Hm  set  will  include  c.  Turnbull  associates  a  ‘tag’  of  'V  with  this  because  Cmc->c 
cannot  be  applied  for  sure  without  another  letter  of  context. 

While  it  is  true  that  given  any  TCG,  a  big  enough  m  will  suffice  to  avoid  this  problem, 
such  a  solution  is  not  very  practical  because  the  trying  of  all  possible  derivations  for 
even  moderate  m  can  take  a  very  long  time. 

What  is  required  is  some  way  of  calculating  Hm  which  works  from  the  right  of  the 
various  rules,  where  context  is  known. 

The  Hm  function  is  used  to  decide  which  productions  might  be  applicable  during  a 
canonical  parse  of  a  sentence  in  the  language.  For  a  type-0  grammar  it  is  possible  for 
nonterminals  to  follow  the  parse  point,  even  in  a  canonical  parse.  For  this  reason  Turn- 
bull  included  nonterminals  in  the  Hm  sets.  However,  with  TCGs,  only  terminals  can  fol¬ 
low  the  parse  point  in  a  canonical  parse.  Therefore,  only  terminal  strings  will  be 
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included  in  the  //m  sets. 


Definition.  Hm((3)  is  defined  to  be  the  set  FIRSTm(L  ((3)).  Q 

Hm{$)  contains  strings  of  terminals  of  length  that  can  begin  strings  derived  from 
1 6.  Recall  that  L( (3)  includes  only  those  strings  that  can  be  derived  from  ^without  addi¬ 
tional  context.  Also  recall  that  FfRSTm{L  {§))  includes  strings  of  length  <ra  only  if 
those  strings  are  actually  in  L  ({3)  (not  just  prefixes). 

Suppose  the  productions  of  a  TCG  are 

P-{AiXi~:>aiXi  j  Ai^.N,  Xi€TM,  1  ^i=>n] 

The  following  sets  can  be  used  to  calculate  Hm. 

Si(w)=Hm(aiXiw),  l^i^n,  w  £7'm 

The  set  S^w)  is  simply  the  set  of  terminal  strings  of  length  m  or  less  that  can  be  a 
prefix  of  something  derived  from  the  right  side  of  production  i  when  it  is  followed  by 
the  string  w.  This  is  calculated  for  each  w  of  the  length  required  to  make  the  number 
of  terminals  at  the  end  of  (XiXiW  at  least  rn. 

Then  Hm(X iXq...Xs)  can  be  calculated  for  an  arbitrary  X1l...Xs€V'*'  using  these  sets 
and  working  from  the  right.  If  an  Xi  is  a  terminal  then  that  terminal  will  always  be  in 
that  position.  If  it  is  a  nonterminal,  say  Ait  then  one  can  discover  which  of  the  sets 
Si(ui)  can  be  substituted  at  that  position  by  examining  the  sets  of  strings  generated 
from  Jh  +  1...Xj.  This  will  be  made  more  exact  after  the  ^-calculating  algorithm  has 
been  presented. 

The  sets  Si(w)  can  be  calculated  by  the  successive  approximation  method  given  in 
the  algorithm  Si~CALC ,  shown  in  figure  4.1. 


4-5 


5t-C4LC7:procedure(C! :  TCC ): 

(Calculate  Si  sets  for  grammar  G) 
i/:function(£:set  of  strings)  returns  set  of  strings: 

(Calculate  current  approximation  to  Hm  function) 

/T:=0 

for  each  a€E  do: 

if  acT*  then  H'  :=H'  u  FrRSTm(a) 

(Suppose  (x-Aa^a2-  •  •  at,  AeN,  a, tcT,  0 ^m. 

All  other  arguments  to  H  are  of  this  form.) 
for  j  :  =  0  to  t  do: 

for  each  i  and  each  w  eT^m 

such  that  AiXiU)  =Aa\a2  '  '  '  dj  do: 

H'  :=H'  u  F/RSTm(Si(w)'aj+iaj+2  •  *  •  at) 

(If  j=Q  then  at  •  •  •  a;=\;  if  j  —  t  then  a;+1  •  •  •  at=\) 
return  H' 
end  H 

for  i:=  1  ton  do: 

for  all  uicT  do: 

if  c<i=A  then  Si(vj  ):-\w  $ 
else  Si(w):=(p 

until  no  change  in  any  Si(w)  do: 
fori:=l  ton,  aiXij^X  do: 
forallu/eT'™  ***'  do: 

(Suppose  (XiXiW  —X\X2  '  ’  •  Xr,  XiCNuT) 

Sii-w'j-Siiw)  \jH{XvH(XeH(Xa  •••  -H(Xr)  ))) 

end  Si~CALC 

Figure  4.1.  Calculating  Si  sets 

The  procedure  given  clearly  terminates  because  only  elements  of  T~m  are  added  to  the 
sets  Sif  and  T^m  is  finite.  The  procedure  only  continues  as  long  as  there  are  still  ele¬ 
ments  of  T~m  to  be  added. 


Theorem.  When  the  Si~CALC  procedure  terminates,  the  following  is  true: 

1)  Si(’uu)=FIRSTm(L  ((XiXiw)),  l^i^n 

2)  The  H  function,  using  the  final  values  of  Si(Wi),  correctly  calculates 
Jfm(0)  for  0&N 

Proof.  The  proof  is  straightforward,  but  tedious.  It  is  given  in  appendix  A. 

In  practice,  one  needn’t  calculate  ^(la)  for  all  -w  cWi  =  T~  1  .  The  number  of  such 

sets  grows  very  rapidly  as  m  increases.  The  reason  that  it  has  to  be  done  for  some 
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wp^\  is  that  L(aiXiUi)  need  not  equal  L{(xiXi)'rw .  For  example,  if  aq  ends  in  Aj  and 
there  is  a  production  AjXj->(XjXj,  with  j  !  <  1  Xj  ! ,  then  that  production  cannot  be  substi¬ 
tuted  when  calculating  L(cXiXi)  because  there  isn't  enough  context.  Yet  it  might  be 
substituted  when  calculating  L  (a^x^w).  Therefore  one  cannot  just  substitute  Z,  (cqxi) 
when  trying  to  see  what  can  be  generated  from  A^Xi  in  the  context  of  another  produc¬ 
tion. 

From  this  argument,  it  can  be  seen  that  the  only  non-null  uTs  for  vvhich  Si(w)  need 
be  calculated  are: 

Wi'  =  {w  !  w^T~l  where  t  =  the  longest  length  of  a  part 
of  context  needed  by  a  nonterminal  in  cq 
which  is  not  provided  in  the  rhs 

The  t  in  the  above  can  be  calculated  be  knowing  the  length  of  the  shortest  terminal 
string  derivable  from  each  nonterminal,  and  the  length  of  context  needed  for  any  rule 
for  each  nonterminal.  To  allow  precalculation  of  these  values,  they  can  be  calculated 
on  the  assumption  that  any  required  context  can  be  found  in  rule  i.  This  assumption 
may  be  false,  in  which  case  a  larger  t  than  necessary  may  be  calculated.  In  spite  of 
this  Wi  should  be  much  smaller  than  If*  if  most  rules  do  not  have  any  right  context. 

Another  practical  saving  can  be  accomplished  by  keeping  track  of  only  those  ele¬ 
ments  of  Si(w a)  which  are  not  just  Si(w)  a. 

It  will  turn  out  that  other  considerations  will  have  the  pleasant  consequence  that  l 
can  usually  be  zero  for  all  productions.  This  removes  several  inner  loops  from  Si-CALC 
and  reduces  space  requirements  drastically.  The  more  general  algorithm  has  been 
given  because  there  is  one  circumstance  where  it  must  be  used. 

Example.  Consider  the  2-TCG: 

1:  A  ^aAa 
2:  A  -^BaC 
3:  C  A 
4:  C->cC 
5:  Bb-*b 
6:  Baa->Bbaa 
7:  Bb  ->Ab 

Notice  that  L{BaC)-a^L{3aC a),  because  BaC  =>Ba  but  production  6  cannot  be  used 
unless  it  is  known  that  the  next  character  is  an  a.  This  will  be  true  in  the  context  of 
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production  1,  but  not  production  7.  The  algorithm  goes  through  the  following  steps 
(only  the  changed  sets  are  shown  after  each  loop): 

Initially: 

Si=Si=St(a)=Sz(b)=Sz(c)=St=Ss=Se=S7=i> 

S3=W 

After  Loop  1: 

Se=|6aj 

After  Loop  2: 

S2(a)=\ba] 

S4=\c,cc  ^ 

After  Loop  3: 

Si=\ab]  ' 

S1-\ab\ 

After  Loop  4: 

S'  i=\ab,  aa  } 

S 6=\ba,ab  $ 

5,7=^a6,  aa  J 

After  Loop  5: 

5,2(a)=[6a, ab  ] 

S6~{ba,  ab,  aa  } 

After  Loop  6: 

S 2(a  )=[ ba,ab,  aa  ^ 


0 


Once  the  Si  sets  have  been  calculated,  it  is  possible  to  calculate  Hm(@)  for  any 
@=Xi...Xs,  Xi €  JV U T,  by  the  formula 
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Hm(Xi...X.)=ff(Xt-fT(...-H(X,)...))) 


(using  H  as  in  the  algorithm,  with  the  final  Si's). 

The  critical  property  of  this  new  method  for  calculating  Hm  is  that  it  gives 
Hm((3)=FIRSTm((L  ((3)),  whereas  Turnbull’s  method  may  have  been  unsure  about 
whether  a  string  should  be  in  the  set  or  not.  With  a  fc-TCG  it  is  always  possible  to  tell 
whether  or  not  a  given  production  can  be  used  if  all  strings  of  length  k  that  can  appear 
in  any  given  context  are  known.  If  context  strings  of  length  k  were  carried  along  with 
each  item,  this  would  always  be  possible.  PGEN  does  not  carry  along  such  context 
strings  because  they  cause  indiscriminate  state  splitting,  which  is  something  the  TCG 
method  is  supposed  to  avoid.  The  view  is  taken  that  any  needed  context  should  be 
added  to  the  grammar. 

So,  while  the  method  just  given  will  always  calculate  correctly  (i.e.,  it  will  find 

those  prefixes  of  strings  that  can  be  derived  from  for  sure),  problem  (b)  mentioned 
above  can  still  occur  when  it  is  used  in  PGEN.  In  such  cases,  PGEN  must  abort. 

Definition.  M(G)  is  said  to  exist  if  PGEN  processes  grammar  G  without 
aborting.  It  is  deterministic  if  there  are  no  states  containing  an  item  with 
the  mark  at  the  end  as  'well  as  containing  some  other  items.  Q 

In  PGENDftp(m),  the  m.  of  Hm  is  the  same  as  the  maximum  length  of  a  context  string 
in  an  item.  If  TCGs  are  being  processed  then  the  method  of  this  thesis  could  be  used  to 
calculate  Hm,  and  the  two  values  of  m  needn’t  be  the  same.  If  this  is  done  and  the 
maximum  context  string  length  is  set  to  zero,  then  the  resulting  algorithm  is  identical 
to  PGEN. 

This  is  an  advantage  when  it  comes  to  showing  properties  of  M(G).  The  most  impor¬ 
tant  property  of  M (G)  is  that  if  it  is  determinisitic  then  it  is  the  controller  for  a  D2SM 
with  the  error  property. 

Proposition.  If  M(G)  exists  and  is  deterministic  then  it  is  a  D2SM  with  the 
error  property.  Furthermore,  as  a  corollary,  MiG)  must  halt  on  all 
inputs. 
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Discussion  of  proposition.  Turnbull's  Lemmas  6.1,  6.2,  and  6.3  are  all 
statements  about  the  properties  of  controllers  built  by  his  parser  genera¬ 
tor.  Basically,  they  show  that  any  item  in  an)r  state  is  “essential”  in  the 
sense  that  they  can  be  used  in  the  parse  of  some  sentence  in  L{G).  Exa¬ 
mining  the  proofs  of  theses  lemmas  shows  that  the))-  could  easily  be 
rewritten  to  hold  equally  well  for  the  M(G)  method.  Informally,  this  is 
because  M(G )  is  like  his  machine  with  length  zero  context  strings.  There 
does  not  seem  to  be  much  point  in  presenting  the  rewritten  lemmas 
required  to  make  this  more  formal. 

From  the  lemmas,  Turnbull’s  theorems  6.1  and  6.2  are  quite  straight¬ 
forward.  The  proposition  is  just  a  rewording  of  those  theorems  to  apply  to 
TCGs.  II. 

Thus,  when  M (G )  is  deterministic  it  has  all  the  attractive  properties  enumerated  in 
the  last  chapter. 

Another  important  consequence  of  the  theorem  is  that  Turnbull’s  grammatical  char¬ 
acterization  applies.  His  theorem  5.7  [Turnbull  75,  p.  5.26]  says  that  any  language 
accepted  by  a  D2SM  with  the  error  property  is  what  he  calls  deterministic  regular 
parsable  (DRP)  [see  Turnbull  75,  p.  5.9j.  The  full  generality  of  the  DRP  property  is  not 
needed  here  —  just  the  property  as  it  applies  to  machines  without  lookahead  and  for 
TCGs.  This  will  be  known  as  the  TDRP  property. 

Definition.  A  TCG  is  terminal  context  deterministic  regular  parsable 
(TDRP)  if  the  following  holds.  Let 

CS{G)~\^pAp \Z  =>*aByz  =>a(3yz,  afiy^Tp, 

the  last  production  used  was  the  pth  production  ,  3y-*fiy, 

Z  is  the  goal  symbol^ 

CS(G )  is  the  set  of  characteristic  strings  for  G.  Whenever  ij/Ap  €CS  (G)  is 
recognized  then  the  production  p  can  be  applied.  The  TDRP  condition  is: 

i)  If  TpApeCS(G)  then  there  cannot  also  be  a  ipyAs  €CS  (G )  with 
yeF*,  s^p. 

ii)  CS (G )  is  regular.  [It  turns  out  that  this  is  always  true  for  TCGs,  but 
this  fact  will  not  be  needed.  This  part  of  the  condition  is  included  to 
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parallel  the  DRP  definition.  ] 


A  language  is  TDRP  if  there  is  a  TDRP  grammar  for  it.  Q 

This  definition  is  obtained  from  Turnbull’s  DRP  definition  by  setting  his  k  to  0  (i.e., 
no  lookahead)  and  using  TCGs  instead  of  arbitrary  TOGs.  Because  this  characterizes 
the  D2SMs  built  by  PGEN,  the  following  important  theorem  is  immediate. 

Theorem.  If  M{G)  exists  then  M(C)  deterministically  accepts  L  {G )  if  and 
only  if  G  is  TDRP. 

Proof.  This  follows  from  the  relation  between  TDRP  and  DRP  grammars, 
Turnbull’s  theorems  5.5,  5.6,  and  5.7,  and  the  proposition,  above,  n 

4.3  Full  TCGs 

So  far  in  this  chapter  it  has  been  shown  how  some  TCGs  can  be  parsed  easily.  The  ques¬ 
tion  of  obtaining  TCGs  to  parse  will  be  addressed  in  the  remainder  of  the  chapter. 

There  are  not  many  circumstances  where  writing  terminal  context  productions 
would  seem  to  be  a  natural  method  for  describing  a  language.  Those  circumstances 
where  CFGs  are  inadequate  for  programming  language  description  also  seem  to  be  awk¬ 
ward  when  TCGs  are  used.  It  was  shown  In  chapter  three  that  the  formal  power  of  TCGs 
is  no  greater  than  that  of  CFGs,  and  if  appears  that  this  is  also  true  for  the  practical 
“ease  of  writing”  power. 

In  this  thesis  TCGs  will  be  generated  from  CFGs  in  such  a  way  that  the  grammars  are 
“the  same”  in  the  sense  that  one  can  easily  recognize  the  underlying  CFG  when  looking 
at  the  TCG  generated  from  it.  The  importance  of  such  TCGs  is  that  they  may  lead  to  a 
method  of  parsing  a  sentence  of  a  language  in  the  way  that  the  grammar  writer  wishes 
when  conventional  methods  for  parsing  the  underlying  CFG  are  impractical.  The  main 
tool  that  will  be  used  for  generating  TCGs  from  CFGs  is  the  ARC  function.  Its  definition 
requires  the  use  of  the  FOLLOW  function. 

Definition.  Given  a  grammar,  G  =  <N,T,S,P>,  the  FOLLOW  function  is  given 
by 
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□ 


FOLLOWk{(x)—[y  | S=>$payz, 

\y\=k,  y.zeT*,  a,@€V*l 


Definition.  The  add  right  context  function ;  ARC(prod)  is  as  follows: 

ARC (Ax-xxx)={Axa  -^axa  (  a^FOLLOW fAx),  all  a  €.7 1 

ARCk  is  defined  as  the  fc-fold  closure  of  ARC  (i.e.,  the  set  of  productions 
obtained  by  adding  k  terminals  of  right  context).  □ 

The  ARC  function  can  be  used  to  replace  a  production  with  an  equivalent  set  of  pro¬ 
ductions  with  another  symbol  of  terminal  context.  Suppose  there  is  an  augmented 
grammar  G=<N,T,S,P>.  Form  a  new  grammar,  G\  by  replacing  a  subset  Q  of  P  by 
[ARC{q)\  g&Q],  Then  the  grammars  are  equivalent  in  the  sense  that  the  two  languages 
L(G)  and L(G')  are  equal. 


Theorem.  With  G  and  G '  as  just  described 
S=>£x  <?>  S  -±>Q'  x  (rer) 


Proof. 

=>  Clearly  it  need  only  be  shown  that  whenever  a  production  in  Q  could 
be  used  during  a  derivation,  it  must  also  be  true  that  there  is  a  pro¬ 
duction  in  ARC(Q)  which  can  be  used  (to  the  same  effect).  A  canoni¬ 
cal  derivation  in  G  can  be  written: 

5  =>c  /M  1X1  ~>G  2X2 

=> G  ^>G  •  •  •  =><?  Xn 

(where  A^  lies  not  to  the  right  of  where  Ai_ j  was). 

Now,  since  G  is  assumed  to  be  augmented,  x j  contains  at  least  J_ , 
so  each  x^  can  be  written  as  a^x^  ,  where  a^-T tx7 T* .  So,  by  the 
definition  of  FOLLOW  fa),  a^cFOLLO  W  fAf).  Therefore,  if  a  production 
has  been  replaced  by  ARC  of  it,  there  will  be  a  new  production  with  a* 
as  right  context,  and  it  may  be  used  to  the  same  effect. 
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<=  The  proof  in  this  direction  is  obvious  because  L{G')GL{G\  since  the 
productions  of  G  can  be  applied  everywhere  that  those  of  G'  can,  with 
the  same  effect. 

□ 

Recall  that  a  full  &-TCG  is  one  in  which  every  production  has  terminal  context  of 
length  exactly  k.  A  particularly  interesting  full  fc-TCG  is  the  one  formed  from  a  CFG  by 
replacing  all  productions  by  ARCk  on  them.  Using  PGEN  on  such  TCGs  yields  a  parser 
intimately  related  to  the  LR(k)  parser  for  the  corresponding  CFG. 

The  machine  M{G')  for  a  full  fc-TCG  is  almost  the  same  as  an  LR(k)  parser  for  the 
corresponding  CFG.  Intuitively,  the  states  are  formed  the  same  way  -  it  is  only  that 
the  items  in  Af/j? {*)(£)  carry  along  the  k  symbol  context  strings  separately  whereas  the 
items  of  M(G’)  contain  the  same  strings  in  the  marked  productions.  Unfortunately,  a 
number  of  complications  arise  when  trying  to  describe  the  correspondence  when  k  is 
greater  than  one.  A  lemma  has  been  developed  which  gives  the  relationship  between 
the  states  of  the  two  parsers,  but  it  is  rather  messy  and  doesn’t  add  much  to  the  intui¬ 
tive  comment  given  above.  For  this  reason  it  has  been  put  into  appendix  A. 

If  k  =  1  and  Mjj{^)(G)  is  deterministic  then  M(G')  is  exactly  the  same  as  Mir^)(G) 
(except  that  the  latter  has  lookahead  transitions  to  apply  states  where  the  former  has 
read  transitions  to  analogous  apply  states).  This  is  because  the  determinism  of 
Mlr{ic)(G)  implies  that  when  there  is  a  lookahead  transition  on  a  symbol  a  then  there 
can  be  no  other  transitions  for  a  in  that  state. 

Example.  A  full  1 —TCG,  C3,  with  the  following  productions: 

S'^El  Aq 
E  ->E  +  T  Ai 
E  -+T  A2 
T-+T*F  Aq 
T  -*E  A4 
F-+(E)  A5 
F^x  Ag 

To  ARC  each  of  these  productions,  the  FOLLOW ,  sets  are  required: 

FOLLOW i(E)= j+,  ),i) 

FOLLOW i(T)=\+,  *,  ).i! 

FOLLOW t(F)=\+,  *,).!! 
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The  new  productions  will  be  labelled  with  the  ‘old’  production  number  and  the  right 
context  character  as  a  subscript  (e.g.,  1+  is  the  production  E  +->E  +T  +  Al+). 

The  state  diagram  for  M(G3)  is  given  in  figure  4.2.  There  is  an  exact  correspondence 
between  it  and  the  LR(l)  table  for  this  grammar  (as  shown  in  [Aho  and  Ullman  72]). 

4.4  Resolving-TCGs 

There  would  be  no  point  in  building  the  controlling  algorithm  using  the  method  given  in 
the  previous  section  unless  it  can  be  shown  that  the  full  fc-TCG  is  not  needed  to  get  the 
full  power  of  LR(k)  in  most  cases.  In  this  section  the  first  of  several  methods  of  adding 
context  to  selected  productions  only  will  be  investigated. 

Given  a  CFG,  G,  one  possible  algorithm  is  as  follows.  Form  the  LR(O)  machine 
Mlr{o)(G)  in  the  regular  way.  That  machine  may  contain  inadequate  items.  An  inade¬ 
quate  item  is  not  the  only  item  in  a  state,  and  it  has  the  mark  at  the  end.  Suppose  that 
a  state  in  Mlr(q){G  )  contains  the  inadequate  item  (A  -»(*•).  Then  in  forming  G',  replace 
production  A  -»w  with  ARC  ( A  ->a). 

Now,  to  avoid  the  abort  problem  there  must  be  enough  context  on  each  production 
so  that  it  can  be  decided  for  sure  whether  or  not  a  given  production  is  in  a  given  clo¬ 
sure.  Suppose  that  a  production  A  -*a  has  been  ARC’ d.  Then  ARC  all  productions 

\B->{$iA(32  I  /?2Z=?>*Z'  some  ze!T*j 

Then  it  can  never  be  in  doubt  whether  or  not  Aa-xxa  is  in  a  given  closure.  The  produc¬ 
tions  just  ARC’d  may  cause  other  productions  to  be  ARC’d. 

Definition.  Call  the  resulting  grammar,  G',  the  Resolving  1 -TCG  (l-RTCG) 
and  write  it  as  R  jC  g. 

If  M(RXG)  isn’t  deterministic  then  the  process  can  be  repeated:  add  (possibly  another) 
character  of  context  to  those  productions  which  are  in  inadequate  items.  And  again,  if 
a  production  Ax->ax  has  been  ARC' d  then  so  must 

\By->$yA$2y  !  §2 yz=>*syz  and  |syj<jxj,  some  zcT*] 

A  series  M(R2G),  M(R3G),  ...  can  be  obtained.  In  practice,  there  will  probably  be  some 
small  k  (say,  2)  after  which  one  would  give  up.  This  is  because  there  are  CFGs  for  which 
M(RkG)  is  nondeterriiinistic  for  all  k,  just  as  there  are  CFGs  which  are  not  LR(k)  for  any 
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r 


Figure  4.2.  State  diagram  for  M(GS) 


k. 

In  fact,  it  will  be  shown  that  those  grammars  for  which  MiR^G)  is  nondeterministic 
are  precisely  the  same  grammars  for  which  Mir^{G)  is  nondeterministic.  This  is 
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important  because  it  shows  that  nothing  is  lost  by  using  PGEN  instead  of  the  LR(k) 
method.  It  has  already  been  shown  that  there  is  much  to  be  gained. 

Theorem.  If  G  is  an  LR(k)  context-free  grammar,  then  the  fc-RTCG,  R^G,  is 
a  TDRP  grammar.  This  means  that  M (R^G)  is  deterministic. 

Proof.  Assume,  to  the  contrary,  that  G  is  LR(k)  but  RkG  is  not  TDRP. 

Then,  because  the  TDRP  condition  is  violated,  there  must  be  derivations 

Z  ^>*yAxa  =£ >yaxz 
and  Z  ^>*y'Byz'  ^>y'$yz' 

with  yax  € PREFIX (y'/Sy ). 

and  Ax-*ax  is  different  from  By->{3y 

Because  of  these  derivations,  (Ax->ax*)  will  be  an  inadequate  item  in 
M(RjcG).  Now  \x  ]  =k  because  of  the  method  used  in  forming  at  each 

stage  of  the  sequence  M(R  0G),M(R  ^G),  ...,M  (RJc_1G),  the  same  inadequacy 
would  have  forced  another  character  of  context  onto  Ax'  -*oix\ 

Now  because  G  is  LR(k),  the  definition  of  LR(k)  implies  that  (using  the 
same  derivations  as  above),  if 

Z  == >*yAxz  ^>yaxz,  |  x  j  -k, 

and  Z  ^>*y'By'z"  ^>y{3y'z",  \y' \-k, 


then 

either  A  ->a  =  B  or  yax  ^PREFIX  (y'fiy' ) 

But  this  contradicts  the  assumption  that  R^G  is  not  TDRP  because  that 
assumption  said 

yax  e PREFIX  (y'0y  )QPREFIX{y'Py' ) 

(noting  that  \y\t>k  so  y  ePREFIX  {y' )).  Hence,  the  original  assumption 
must  have  been  wrong. 

Therefore,  the  theorem  has  been  proved.  □ 

Example.  Given  the  grammar  C4: 
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Z  ~>S  j_  Aj 
S^T  A2 
5  -*SaT  A3 
T  -+cEd  A4 
T  dE  Ag 
E  -*  a  A6 
E  ah  A7 

Figure  4.3  shows  Mir{q){G^) 


Z  ->*S  J_  - 

S->*T 

i  S^SaT  i- 

T  -»• cEd  , 

i  T  ->wdE  - 

I - , - ■ 


S  — 
d 

c 


-T  Z  ->5*1  • 

51  ->S  »a T 


- 1 


a 


i  .a 


T 

E 

E 


-*d»E 
-**a 
->*a  b 

E 

(i) 


y 

|  S^Sa*T 
T  -^*cEd 
c - *  T  -**dE 


r 


/ 


Figure  4.3.  LR(O)  machine  for  C4 


The  transition  marked  f?J  is  nondeterministic.  The  ‘resolving  TCG’  method  requires 
the  replacement  of  E-*a  by  ARC  (E-*a).  This  in  turn  requires  that  T^-dE  be  ARC’d 
(because  it  ends  in  E),  which  in  turn  requires  that  S^T  and  S^-SaT  be  ARC’d.  This 
yields  R  [G4: 


4-17 


Z-*S ]_  Aj 
S 1  ->T  1  Aaj 
Sa-^Ta  &2a. 

SL->SaTL  A3J 
Sa-*SaTa  A3a 
7  ->cEd  A4 
7  J_  ~*dE  J_  A5J 
Ta-*dEa  £^a 
E  a-*aa  A6a 

E  ]_~*a  L  Ae  j 
Ed->ad  Agd 
E  -+ab  A7 

Using  PGEN  on  this  grammar  yields  the  deterministic  machine  M (R  shown  in  figure 
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Figure  4.4.  M(R^G^) 


/ 


■ — i - 

I  Sj_->SaT*L 
\  Sa  ->SaT»a 


1  a 

^3l)  (A 3a 


..J 


E 


-J 


T 1  -+cLE*  1 
7a  -♦  dE  ®a 


1 


a 


(^51'  (A5a 


□ 


4-18 


4.5  Simple  Resolving  TCGs 


An  unfortunate  aspect  of  the  RTCG  method  is  that  context  must  be  added  to  produc¬ 
tions  for  the  sole  purpose  of  making  it  possible  to  do  the  closure  correctly.  In  this  sec¬ 
tion  a  method  for  avoiding  this  will  be  presented.  Unfortunately  it  will  not  be  done 
without  a  possible  sacrifice  of  the  error  handling  ability  of  the  resulting  parser. 

Definition.  The  productions  of  a  (Zc-l)-TCG  which  must  be  ARC'd  to  yield 

RkG  are  in  one  of  two  categories: 

a)  Productions  which  must  be  ARC’d  because  of  inadequate  states  in 
M (Rk„iG).  Call  these  L-productions  (because  their  added  context 
is  used  for  an  analogous  purpose  to  Zoofcahead  strings  in  an  LR(k) 
parser:  to  help  choose  among  several  parsing  paths). 

b)  Productions  which  must  be  ARC' d  only  so  that  the  closure  algo¬ 
rithm  won’t  abort.  Call  these  C -productions. 

0 

The  context  added  to  the  C-productions  is  redundant  after  the  parser  has  been 
built.  This  is  because  M{Rk-\G)  had  no  problem  deciding  when  to  reduce  a  C- 
production,  and  going  to  M(RkC)  cannot  add  any  new  terminals  to  the  sets  that  can  be 
seen  in  states  where  such  reductions  are  done.  Thus,  one  can  systematically  remove 
the  last  context  symbol  from  each  C-production  after  M (RkG)  has  been  built,  removing 
the  last  transitions  to  reduce  states.  After  this  is  done,  there  may  be  some  identical 
states  in  the  resulting  machine  that  can  be  merged. 

Removing  context  from  the  C-productions  after  PGEN  is  finished  is  a  useful  reduc¬ 
tion  technique  -  one  which  will  be  discussed  in  the  next  chapter.  However,  adding  con¬ 
text  to  the  C-productions  may  cause  unnecessary  state  splitting.  While  such  split 
states  may  be  merged  again  after  context  is  removed,  such  a  procedure  resembles  the 
method  whereby  an  LR(k)  parser  is  generated  and  then  compatible  states  are  merged 
afterwards.  If  the  intermediate  parser  is  of  managable  size  this  is  satisfactory,  but  if  it 
is  too  big  to  process  in  a  reasonable  amount  of  time  then  some  fallback  method  must 
be  used.  It  would  be  nice  if  some  method  could  be  found  which  avoided  AR Cing  the  C- 
productions  altogether. 
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One  method  is  to  relax  the  closure  requirement.  Instead  of  asking  that  one  know  for 
sure  whether  or  not  an  item  should  be  in  a  given  closure,  some  decision  procedure 
could  be  used  which  perhaps  adds  more  items  than  it  needs  to  a  closure.  Of  course  it 
must  be  ensured  that  doing  so  does  not  change  the  language  parsed. 

Taking  inspiration  from  the  “simple  LR“  method  of  calculating  lookahead  strings, 
the  following  is  proposed.  Suppose  that  one  is  trying  to  calculate  C  {Ax  f3%).  If 

there  is  a  production  By  iyz~*6y  \y%  with  y  \€.FIRSTk{L  (@2))  (and  y?p^X)  then  it  is  unsure 
whether  {By  \y  %-+*5y  yy 2)  is  in  the  closure.  Rather  than  abort,  add  it  to  the  closure  iff 

yz£FOLLOW{yi](BVl). 

Definition.  Call  the  grammar  that  results  from  ARCing  only  the  L- 
productions  of  a  grammar,  G,  the  Simple  Resolving  TCG  (SRTCG)  and 
denote  it  by  SRkG.  Call  the  parser  generator  using  the  above  method  for 
closure  (with  FOLLOW  where  needed)  PGENS,  and  the  resulting  machine 
Ms(SRkC).  0 

If  PGENs  is  used  on  a  grammar  which  was  generated  from  a  CFG  by  ARCing  selected 
productions  then  the  closure  becomes  even  simpler:  for  any  production  By  \yz~> 6y  \y  2 
in  such  a  grammar,  it  is  guaranteed  that  y ^.FOLLOW \y  \{By  1),  because  of  the 

definition  of  ARC.  Thus,  FOLLOW  sets  need  never  be  calculated  -  all  uncertain  items 
will  be  added  by  PGENs- 

The  result  of  using  PGENs  is  a  parser  which  is  like  an  SLR(k)  parser  in  some  parts 
and  an  LR(k)  parser  in  other  parts.  It  is  like  that  latter  in  a  sufficient  number  of  places 
that  the  parser  will  work  if  the  LR(k)  parser  would. 

Theorem.  If  G  is  an  LR(k)  context-free  grammar  then  the  fc-SRTCG,  SRkGt 
is  such  that  Ms{SRkG)  is  deterministic. 

Proof.  From  the  previous  section,  M{RkG )  is  deterministic.  It  can  be 
seen  that  Ms{SRkG)  is  deterministic  because  it.  is  “similar”  to  M{RkC)  in 
those  states  which  were  inadequate  in  M (R*_iC).  Here  “similar”  means 
that  the  only  changes  between  the  two  arc  the  removal  of  a  symbol  of  con¬ 
text  from  the  items  for  C-productions.  Such  context  removal  cannot 
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affect  the  determinism  of  the  machine  because  otherwise  MsiSRjt-iG) 
would  have  been  nondeterministic  there,  contradicting  the  fact  that  a  Co¬ 
production  is  involved.  Q 

The  following  example  shows  the  use  of  the  “simple  resolving”  method,  as  well  what 
PGEN  does  in  general  for  A;>1. 


Example.  Recall  the  grammar  G%  from  chapter  3. 

S->aB  Aj 
BAC)  Aa 
C  -*C,D  A3 
C-*D  A4 
D  -*  bE  Ag 
E  ->c  Ag 
E  -*E,  c  A7 

Figure  4.5  shows  M(G2). 
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Figure  4.5.  Parser  for  Gr 


The  state  marked  is  nondeterministic  —  production  5  is  the  problem.  The  1-SRTCG, 
SR^Go,  is  the  same  as  C2  except  that  production  5  is  replaced  by 

ARC (D  -*bE)  =  D)-*bE)  and  D,  -> bE , 

Also,  the  goal  production  is  augmented  with  one  goalpost  to  give  51  -*aB  1 . 
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This  yields  M{R\Gz),  shown  in  figure  4.6. 
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Figure  4.6.  Parser  for  SR1G2 


The  state  marked  is  nondeterministic.  Now  it  is  production  5  which  is  the  prob¬ 

lem.  SR%Gz  is  formed  by  replacing  production  5,  with 

ARC{D,  -* bE ,  )  =  D,b->bE,b 

And  again,  the  goal  production  is  augmented.  The  resulting  parser,  M  (SR2G2)  was 
shown  in  figure  3.4  except  for  the  goal  posts  on  the  end  of  the  goal  production,  which 
turn  out  to  be  unnecessary  for  this  grammar. 

In  this  case  the  SRTCG  method  worked  as  well  as  one  could  wish:  context  was  added 
only  to  a  selected  production,  and  it  turns  out  that  the  things  added  in  the  various  clo¬ 
sures  were  entirely  correct,  so  that  Ms{SR2G2)  is  a  D2SM  with  the  error  property,  jj 

What  is  lost  by  using  this  simple  method  of  closure?  One  might  think  that,  analogous 
to  the  loss  in  going  from  LR(lc)  to  SLR(k),  there  would  be  a  smaller  class  of  grammars 
for  which  the  SRTCG  method  produces  a  deterministic  parser.  However  the  theorem 
above  shows  that  this  is  not  the  case,  so  the  analogy  between  SLR  and  SRTCG  should 
not  be  taken  too  literally.  The  thing  that  is  lost  by  using  the  SRTCG  method  is  the  abil¬ 
ity  to  detect  errors  as  soon  as  possible.  Some  item  (By^dy)  might  be  added  to  a 


4-22 


closure  when  in  fact  y  cannot  appear  after  B  is  read  from  the  particular  state  in  ques¬ 
tion.  This  is  similar  to  the  phenomenon  that  the  SLR  lookahead  algorithm  calculates 
strings  which  cannot  possibly  be  seen  from  some  states.  The  result  is  that  the  error 
property  is  lost:  some  or  all  of  6y  might  be  read  when  there  is  in  fact  no  valid  continua¬ 
tion  string  which  would  yield  a  sentence  in  the  language. 

The  loss  of  the  error  property  would  be  unacceptable  if  it  were  not  for  the  fact  that 
there  is  only  a  very  limited  loss  when  the  grammar  was  derived  by  ARCing  a  CFG.  In 
such  a  case  only  the  terminal  context  part  of  a  production  can  be  read  erroneously. 
The  following  definition  formalizes  the  error  handling  ability  of  an  SRTCG  parser. 

Definition.  A  TCG  parser  is  said  to  have  the  context-limited,  error  'property 
if  the  only  symbols  that  can  be  read  erroneously  are  symbols  in  the  termi¬ 
nal  context  part  of  the  productions.  A  symbol  is  read  erroneously  if  there 
is  no  possible  input  continuation  string  starting  with  that  symbol  such 
that  the  parse  will  continue  on  to  acceptance,  Q 

If  a  TCG  parser  has  the  context-limited  error  property,  this  has  the  following  conse¬ 
quences: 


•  As  symbols  are  read  onto  the  L-stack,  they  can  be  “tagged”  if  they  are  part  of 
a  possibly  erroneous  terminal  context. 

•  For  a  /c-TCG,  the  maximum  number  of  “tagged”  symbols  on  the  L-stack  is  Ic. 
This  is  because  the  context  for  one  production  only  can  appear  on  the  top  of 
the  L-stack,  since  the  parse  is  canonical. 

•  Thus,  at  most  k  input  symbols  can  be  read  after  an  error  should  have  been 
signalled.  Furthermore,  those  symbols  are  tagged,  so  that  the  parser  can 
take  the  appropriate  care  not  to  base  semantic  actions  on  those  symbols. 
Unfortunately,  several  reductions  may  be  done  before  an  error  is  discovered. 
It  is  actually  somewhat  irrelevant  that  the  symbols  have  been  read  rather 
than  simply  looked  at,  because  they  will  all  be  back  on  the  input  when  the 
error  is  actually  detected. 

The  result  is  that  a  TCG  parser  with  the  context-limited  error  property  behaves  analo¬ 
gously  to  an  LALR  parser  in  the  presence  of  an  error:  the  latter  may  do  reductions 
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based  on  looking  at  erroneous  symbols  at  the  front  of  the  input.  This  type  of  error 
behaviour  is  acceptable  because  the  error  is  discovered  reasonably  close  to  the  place 
in  the  input  where  it  should  have  been  discovered.  It  is  not  quite  as  good  as  the  error 
behaviour  of  parsers  with  the  true  error  property  (as  defined  by  Turnbull,  and  given  in 
chapter  3  of  this  thesis).  Those  parsers,  which  include  the  LR(k)  and  RTCG  parsers,  will 
not  read  or  reduce  after  an  erroneous  symbol  has  been  seen. 

A  parser  generated  by  PGENs  does  not  necessarily  have  the  context-limited  error 
property.  However,  when  it  is  used  on  an  SRTCG  grammar  (i.e.,  one  generated  by  ARC- 
ing  productions  of  a  context-free  grammar)  then  this  is  the  case. 

Theorem.  If  C  is  an  LR(k)  context-free  grammar  then  the  SRTCG  parser 
Ms (SRk  G )  has  the  context-limited  error  property. 

Proof.  Assume  that  some  nonterminal  B  can  be  read  from  a  given  state 
and  lead  to  a  correct  parse  for  some  input  continuation,  no  matter  how 
the  state  was  reached.  If  B -+5  is  a  production  of  the  context-free  gram¬ 
mar  G  then  5  can  be  read  from  the  given  state  and  lead  to  acceptance, 
because  that  6  can  be  reduced  to  B  and  then  the  assumed  continuation 
can  occur.  Thus,  for  any  i,  at  least  one  of  the  right  sides  of  the  set 
ARCl(B  ->6)  can  be  read  from  that  state  (  some  string  in  FOLLOWi(B )  can 
be  read,  and  that  string  will  appear  in  the  ARC  set).  So  even  if  some  items 
with  impossible  contexts  are  added  to  the  state  by  PGENs •  it  will  always 
be  true  that  at  least  a  subset  of  the  items  with  the  same  underlying 
context-free  part  is  valid  (in  the  sense  that  one  of  the  set  will  lead  to 
acceptance  after  the  context-free  part  is  read).  This  means  that  as  long 
as  the  parser  is  reading  the  non-added-context  part  of  a  production  it  is 
not  reading  erroneous  symbols.  Induction  shows  that  the  parser  has  the 
context-limited  error  property.  Q 

Example.  For  an  example  of  the  erroneous  actions  allowed  by  the  context-limited 
error  property,  see  the  partial  state  diagram  in  figure  4.7.  Only  the  states  and  transi¬ 
tions  relevant  to  this  discussion  are  shown.  The  symbols  with  bars  over  them  are  those 
that  PGENs  could  tag  as  being  possibly  erroneous:  they  were  “uncertain”  when  the  clo¬ 
sures  were  being  done.  In  state  2,  all  of  the  added-eontext  versions  of  D-*A  are  in  the 
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Figure  4.7.  Illustrating  context-limited  error  property 

closure  because  there  is  nothing  after  the  D  in  the  item  (E->e*D).  In  fact,  it  can  be 
seen  from  state  1  that  only  (Dbc  -^*Abc )  should  be  in  the  closure  (but  PGENs  doesn’t 
discover  this).  Similarly,  in  state  3  only  (Bb->^cb)  should  be  in  the  closure  but  PGENS 
adds  (Bd-^cd)  also.  The  result  of  this  is  that  all  of  the  transitions  marked  *■*’  are 
reads  of  erroneous  symbols.  The  worst  possible  action  is  as  follows:  suppose  the  parse 
reaches  a  point  where  it  is  in  state  3  and  the  front  of  the  input  is  cde....  Then  the  parse 
will  continue  like: 

-  read  the  c, leading  to  state  4 

-  read  the  d,  reducing  by  Bd->cd  and  ending  in  state  3 

-  read  the  B,  reducing  by  A  ->bB  and  ending  in  state  2 
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-  read  A  then  d  then  e,  reducing  by  Dde  ->Ade  and  ending  in  state  2 


-  read  the  D,  reducing  by  E  -^eD  and  ending  in  state  1 

—  discover  the  error  because  there  is  no  d  transition  from  state  1 

The  error  should  have  been  detected  as  soon  as  the  d  was  seen  in  state  4,  but  several 
reads  and  reductions  were  done  after  that  point.  The  context-limited  error  property 
holds,  however,  because  at  no  time  were  there  more  erroneous  symbols  on  the  R-staek 
than  the  d  and  the  e,  and  these  were  part  of  the  terminal  context  of  the  production 
being  worked  on.  An  SLR(2)  parser  would  have  done  the  corresponding  reductions  and 
ended  in  the  same  state  with  the  front  of  input  looking  the  same.  An  LALR(2)  parser 
could  do  the  same  for  certain  inputs  too. 

4.6  A  Modified  LALR(k)  Parser 

It  is  possible  to  use  the  concept  of  TCGs  to  build  a  parser  which  is  similar  to  a  LALR(k) 
parser,  but  which  only  needs  to  look  at  one  symbol  at  a  time  to  make  each  parsing 
decision.  The  general  idea  is  that  instead  of  looking  at  k  symbols,  those  symbols  can  be 
read  -  making  state  transitions  -  as  long  as  they  are  pushed  back  onto  the  input  after 
a  reduction. 

LALR(k)  parsers  were  introduced  by  DeRemer  [DeRemer  69].  They  are  LR(0) 
machines,  but  with  k  symbols  of  input  used  at  each  parsing  step  in  order  to  decide 
which  transition  to  make.  With  LALR  the  sets  of  lookahead  strings  labelling  the  transi¬ 
tions  are  the  smallest  sets  that  still  allow  any  sentence  in  the  language  to  be  parsed. 
The  following  definition  is  adapted  from  [DeRemer  and  Pennello  79]  (where  it  was  given 
for  k—  1). 

Definition.  Given  a  state  q  in  an  LR(0)  machine  for  a  grammar  G,  contain¬ 
ing  an  item  (A 

LAfc(q,A  ->a)  =  \x  |  xeTk, 

q start  z  I-*  Oaqxy  h  6  q' Axy  V-*  qAccEPT> 

some  y,  z  eT*.  0€  V*, 

where  7x  is  the  pushed  form  of  a] 

Similarly,  given  a  state  p  in  the  LR(O)  machine  containing  an  item 
(A  -*a*Xp),  with  XeNvT, 
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LA k(jp, X)  =  \x  |  x€.Tk, 

q START  z  !-*  OpXxy  h  ex  p' xy  !-*  q accept. 
som ey,z€T*,  6cV*] 

0 

Basically,  LAk{q,A  -*a)  is  the  set  of  length-fc  strings  which  can  appear  after  the  reduc¬ 
tion  of  A  ->a  in  state  q  and  lead  to  the  parser  accepting  for  some  possible  continuation. 
LAk(p,X)  is  the  set  of  length-fc  strings  which  can  appear  after  reading  X  in  state  q  and 
lead  to  the  parser  accepting  for  some  possible  continuation. 

In  an  LALR(k)  parser,  a  reduce  transition  is  taken  if  the  next  k  input  symbols  are  in 
the  appropriate  LAk  set;  a  read  transition  on  terminal  symbol  a  is  made  if  the  next  k 
symbols  are  in  a -LAk -\(p,  a);  nonterminal  transitions  are  done  after  reductions.  Thus, 
a  lot  of  comparisons  are  made  to  length-Ar  strings.  For  this  reason,  most  others  gen¬ 
erally  dismiss  LALR(k)  for  fc>l  as  impractical.  However,  if  one  allows  the  pushing  of 
terminals  back  onto  the  input  queue  —  as  with  TCGs  —  then  the  problem  can  be  gotten 
around.  The  method  proposed  is  as  follows: 

1)  Build  the  LR(O)  machine  for  the  grammar. 

2)  Suppose  there  is  an  inadequate  apply  for  production  .4  -»  a  in  state  q.  Replace  the 

item  (A -*<*•)  in  q  by  the  set  $  (A  a ->  a»a  )  |  a^LA  \{q,A  If  a£L4  \(q,A  -*a)  did 

not  already  have  a  transition  in  state  q  then  create  a  read-and-apply  transition 
on  a,  applying  Aa^aa.  Otherwise,  follow  the  existing  a  transition  to  state  p  and 
add  ( Aa->aa •)  to  p.  If  there  were  other  paths  into  p  then  the  machine  might 
incorrectly  apply  Aa-*aa  after  entering  along  those  paths.  This  will  be  detected 
immediately  only  if  after  doing  an  incorrect  apply  the  symbol  .4  cannot  be  read. 
Therefore,  check  that  for  each  state  r  such  that  reading  a  length  |  aa  \  string 
from  r  leads  to  p  and  (4x->*ctx)  (for  some  xeT*)  is  not  in  state  r,  the  symbol  .4 
cannot  be  read.  If  this  is  not  the  case  then  this  method  fails. 

3)  If  a  state  p  has  an  inadequate  item  ( Ax->ax» )  where  |  x  |  =i,  then  LALR(i)  was 

insufficient.  Replace  the  item  (. Ax-*ax •)  by  the  set 

-uxxb  )  |  xb  €LAi+i{r,A  ->a)  where  r  is  such  that  reading  ax  from  r  leads  to 
p].  And  again,  find  the  transition  for  such  b’s,  as  in  (2).  There  would  be  some 
limiting  i  after  which  one  would  give  up. 
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Notice  that  the  only  new  states  added  by  the  above  procedure  are  those  states 
reached  by  transition  on  the  final  context  character.  They  are  almost  always  states 
with  only  one  reduce  item  in  them,  so  that  they  needn’t  be  explicitly  represented  in 
the  machine.  The  exception  is  that  if  a  state  originally  had  two  reduce  items,  and  the 
LAic-i  sets  are  not  disjoint  for  k>  1,  then  states  have  to  be  created  to  read  the  overlap¬ 
ping  strings.  This  means  that  the  resulting  machine  is  about  the  same  size  as  the  LR(0) 
machine.  Also,  it  uses  only  one  character  at  a  time  to  make  parsing  decisions,  and  it 
reads  k  characters  of  context  for  a  production  only  when  k  —  1  characters  are 
insufficient  to  tell  what  action  should  be  done.  Thus,  the  resulting  parser  has  very 
attractive  properties.  There  are,  unfortunately,  circumstances  in  which  the  above  pro¬ 
cedure  will  not  construct  a  deterministic  parser  despite  the  fact  the  grammar  is 
LALR(k).  This  will  be  discussed  later. 

Definition.  Call  the  above  the  TLALR(k)  method.  The  resulting  parser  will 
be  called  MTLALR(k)(G ).  □ 

Example.  Consider  grammar  G  5: 

Z-»Eab\_  Al 
E  -*e  Ag 
E  -*EcD  A3 
D  A4 
D-+Dad  A5 

This  grammar  is  LALR(2)  but  not  LALR(l).  The  LALR(2)  machine  for  G5  is  shown  in 
figure  4.8. 

Because  some  of  the  actions  in  the  LALR(2)  machine  require  looking  at  two  symbols, 
the  encoding  of  the  parsing  actions  has  to  be  more  complicated  than  the  usual  encod¬ 
ings  methods  used  for  LALR(l).  At  best,  the  parser  must  check  at  each  action  whether 
one  symbol  is  sufficient  to  decide  the  action.  Also,  if  there  are  a  lot  of  two-symbol 
strings  having  the  same  first  symbol,  all  implying  the  same  action,  then  these  must  all 
be  represented  in  the  machine  somehow.  This  complicates  the  encoding  even  more. 

The  TLALR(k)  method  avoids  these  complications.  It  will  be  seen  in  the  next  chapter 
that  parsing  any  of  the  TCGs  presented  in  this  chapter  is  just  about  the  same  as  pars¬ 
ing  a  CFG  with  the  LALR(l)  algorithm.  Also,  very  few  transitions  need  be  added  to  the 
LR(O)  machine  to  get  a  correct  parser.  The  TLALR(2)  parser  for  G 5  is  shown  in  figure 
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Figure  4.B.  LALR(2)  machine  for  C5 


4.9.  It  can  be  seen  that  Mti.ju.r{9\(GZ)  is  deterministic. 


Z-+»Eab . 
E  ->me 
E  -**EcD 
D  -**d 
D  ->*Dad 
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D  ->D  *ad 
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Z  -+E*ab 
E  -*E  *cD 


N 


E  -*Ec*D 
D  ->*c£ 

D  -+*Dad 
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1 _ 

Ec  ->EcD*c 

Eab  -*EcDnb 

D  ->D  *ad 


a 


Figure  4.9.  TLALR(2)  machine  for  £5 
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It  can  bo  won  that  if  Aaa*,  is  applied  after  reading  Dab  from  the  start  state 
(incorrectly),  then  this  will  be  detected  because  too  many  symbols  will  be  popped  off  of 
the  stack.  Q 


As  mentioned  above,  the  TLALR(k)  method  may  fail  to  yield  a  deterministic  parser 
even  thought  the  grammar  is  LALR(2).  This  is  illustrated  by  the  following  example. 


Example.  Replacing  the  second  production  of  Gg  gives  G6: 

Z  -+Eab  X  Aj 
E  -*D  A2 
E  -*E  cD  A3 
D  A4 
D  -*Dad  Ag 


This  grammar  is  also  LALR(2),  but  when  the  TLALR(2)  method  is  tried  the  machine  is 
nondeterminsitic.  Figure  4.10  shows  Mtlalr( 2)(C6). 


a 


Z ->£*a*6  X 


T 


Z  -*Eab*\_ 


Figure  4.10.  TLALR(2)  machine  for  C6 


The  state  marked  and  the  state  marked  were  nondeterminstic  in  the  LR(0)  and 
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LALR(l)  versions  of  this  machine.  If  the  lookahead  set  [ab.cd]  were  looked  at  in  those 
states  then  the  machine  would  become  deterministic.  However  the  TLALR(2)  machine 
follows  the  a  to  the  same  state,  which  means  that  the  action  on  reading  a  b  in  that 
state  is  unknown. 

The  problem  with  the  TLALR  method  cannot  be  solved  without  detecting  cir¬ 
cumstances  such  as  the  above  and  creating  new  states  so  that  following  lookahead  for 
one  production  does  not  merge  with  the  identical  lookahead  for  another  production. 
But  this  is  precisely  the  vague  sort  of  state  splitting  which  was  to  be  avoided  by  using 
the  methods  of  this  thesis.  Therefore,  if  the  above  problem  occurs,  one  of  the  other 
methods  (such  as  SRTCG)  should  be  used. 

Another  case  for  which  the  TLALR(k)  method  will  not  work  in  spite  of  the  grammar 
being  LALR(k)  is  when  the  check  for  detecting  incorrect  applies  reveals  that  such  error 
would  not  be  detected  right  away.  This  is  very  undesirable  -  and  in  fact  it  is  possible 
for  an  incorrect  sentence  to  be  accepted  as  valid  (because  popped  symbols  are  not 
checked  to  make  sure  that  they  match  those  of  the  apply  being  done).  So  if  the  check 
fails  one  of  the  other  methods  of  this  thesis  should  be  used. 

To  implement  the  TLALR(k)  method  it  is  necessary  to  be  able  to  calculate  the 
LALR(k)  lookahead  sets  for  certain  states  in  the  LR(0)  machine.  While  practical 
methods  for  doing  this  for  k- 1  have  been  published  [Lalonde,  Lee  and  Horning  71], 
[Anderson,  Eve  and  Horning  73],  [Aho  and  Johnson  74],  [DeRemer  and  Pennello  79], 
nobody  seems  to  have  addressed  this  problem  for  k>  1.  Such  a  method  will  be 
presented  here. 

There  seem  to  be  two  basic  methods  for  calculating  the  LALR(k)  lookahead  sets. 
One  method  carries  along  sets  of  length-^  terminal  strings  with  each  item  as  the  LR(0) 
machine  is  constructed.  Whenever  a  closure  is  done,  the  set  of  possible  lookahead 
strings  for  a  given  item  is  calculated  from  the  marked  item  that  generated  the  closure 
item.  If  a  newly  calculated  state  is  found  to  exist  already,  then  the  new  lookahead  sets 
must  be  merged  into  the  old  state  -  which  in  turn  causes  effects  to  propagate 
throughout  the  machine.  For  more  details,  see  [Aho  and  Johnson  74].  This  method 
does  not  seem  to  be  too  practical  for  fc>l,  because  of  the  large  number  of  strings  gen¬ 
erated,  most  of  them  never  being  needed.  Because  it  is  expected  that  the  grammars 
will  only  need  fc>l  in  a  few  isolated  places,  it  would  be  desirable  to  have  a  method 
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which  needed  to  calculate  only  a  few  of  the  LALR(k)  lookahead  sets. 


The  second  basic  method  that  has  been  published  for  calculating  LALR(l)  lookahead 
sets  depends  on  simulating  the  LR(0)  machine  to  see  what  it  does  next  after  a  reduc¬ 
tion.  This  method  seems  to  have  been  published  correctly  for  the  first  time  in  [DeRe- 
mer  and  Pennello  79].  It  is  quite  tricky  to  make  sure  that  only  proper  paths  are  traced 
through  the  machine  after  a  reduction,  so  that  the  lookahead  sets  found  correspond 
exactly  to  those  specified  in  the  definition  given  above.  A  simulation-type  method 
would  work  well  for  the  TLALR(k)  method,  because  it  can  be  used  in  only  those  places 
where  it  is  needed  and  with  only  as  big  a  k  as  needed. 

For  k>  1  the  simulation  becomes  more  difficult  because  the  length-fc  strings  that  can 
follow  a  given  reduction  can  be  made  up  of  bits  and  pieces,  each  coming  from  different 
productions.  It  gets  harder  to  see  which  terminal  transitions  can  validly  be  taken  in 
which  circumstances.  The  following  method  uses  some  grammar-based  calculations  to 
help. 

LAk  :function(g,A  -*a) 

L  :=  <f> 

for  each  state  p  containing  ( A  -*»a)  such  that 
reading  a  from  p  leads  to  g: 

L  :=  L\JTRANS_JLAk{p,A ) 

return  L 
end  LAk 

TRANS_JLAk  :function(p,A  ) 

Ln  :=  <f> 

for  each  item  ( B  ->{$*A  7)  in  state  p: 

for  each  state  q  containing  (B  -♦•/8A7)  such  that 
reading  from  q  leads  to  p: 
for  i:=0  to  k  —  1: 

Ln  :=  Ln'oL^y)’  TRANS _JLAk^i{q,B) 

Ln  :=  Ln\JHk{y) 
return  Ln 
end  TRANS_JLAk 

Not  shown  above  is  the  calculation  of  Hk{y),  which  is  the  same  H  used  earlier  in  this 
chapter,  and  the  calculation  of  Li( 7)  which  is  defined  as 

7)  =  [  ref,  7=>*x] 

(i.e.,  the  set  of  strings  derivable  from  7  of  exactly  length  k).  Such  a  set  can  be 
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calculated  for  a  context-free  grammar  by  simply  trying  all  derivations  from  7  and  stop¬ 
ping  a  derivation  sequence  when  it  couldn’t  possibly  derive  a  string  of  length  t*k. 

The  calculation  of  the  LA  set  for  a  reduce  transition  is  obvious:  simply  find  all  states 
which  can  get  control  after  the  reduce  is  done  and  see  what  can  be  read  after  reading 
the  left  side  nonterminal.  In  the  calculation  of  the  LA  set  for  a  nonterminal  transition, 
the  assignment  Ln:=Ln'dHk('/)  is  also  clear.  It  must  be  possible  to  read  any  string  of 
length  k  or  greater  derivable  from  7  after  reading  the  A  in  production  B->fiAy.  The 
other  strings  in  LA  (p,A  )  are  made  up  of  two  parts:  a  length  i  string  derivable  from  7 
and  a  length  k-i  string  which  can  be  seen  after  doing  the  reduction  of  B -*{3Ay  and 
reading  the  B. 

The  above  procedure  is  a  lot  of  work,  but  it  need  only  be  done  in  those  few  cases 
where  k>l  is  necessary.  One  of  the  more  efficient  LALR(l)  methods,  which  calculate  all 
the  LALR(l)  lookahead  sets  at  once,  can  be  used  in  to  handle  the  majority  of  cases 
where  k  —  1  suffices. 
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5.  PRACTICAL  CONSIDERATIONS 


The  last  chapter  presented  various  methods  of  parsing  a  subclass  of  context-free  gram¬ 
mars  by 

i)  converting  the  CFG  into  a  terminal  context  grammar 
ii)  developing  a  parser  for  a  TCG 

This  chapter  will  discuss  some  of  the  practical  aspects  of  using  these  TCG  parsers.  The 
time  and  space  properties  of  both  the  parsers  and  the  parser  generators  will  be  inves¬ 
tigated. 

The  worst  cases  for  some  TCG  properties  are  rather  bad,  but  this  is  a  characteristic 
of  most  parser  mechanisms.  Grammars  are  most  often  used  for  describing  program¬ 
ming  languages,  so  the  example  grammars  used  for  statistics  gathering  are  of  that 
type.  Context-free  grammars  were  developed  for  arithmetic  expressions  (AE),  a  simple 
programming  language  (SPL),  the  XPL  language,  and  the  OLGA  language.  Some  proper¬ 
ties  of  the  grammars  are  given  in  the  following  table. 


Language 

Terminals 

Nonterminals 

CFG  Productions 

AE 

6 

4 

7 

SPL 

21 

13 

27 

XPL 

39 

50 

99 

OLGA 

70 

53 

124 

The  “simple  programming  language”  is  similar  to  that  given  in  [Backhouse  79].  It  has 
expressions  and  sequencing  control  structures  similar  to  many  programming 
languages,  but  only  one  “procedure”.  The  XPL  grammar  is  almost  the  same  as  the  one 
in  [McKeeman,  Horning  and  Wortman  70]:  the  only  difference  is  that  some  of  the  opera¬ 
tors  have  been  grouped  into  operator  classes,  to  help  cut  the  size  of  some  of  the  FOL¬ 
LOW  sets.  The  OLGA  grammar  is  similar  to  the  one  in  [Lewis  79]  (for  a  language  similar 
to  that  described  in  [Abourbih  et  al.  78]).  All  of  the  grammars  AE,  SPL,  XPL,  and  OLGA 
are  LALR(l).  To  see  what  effect  fc=2  has  on  the  parsers,  a  modification  was  made  to 
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SPL  to  make  it  LALR(2).  The  modification  was  to  overload  the  use  of  the  comma  in  a 
declaration  list:  a  comma  was  used  to  separate  both  declaration  statements  and  the 
identifiers  within  those  statements.  This  grammar  is  called  SPL2. 

A  prototype  version  of  the  PGEN$  parser  generator  has  been  implemented.  This  was 
used  to  find  parsers  for  some  RTCGs  and  SRTCGs,  to  ensure  that  the  method  is  viable 
and  for  statistics  gathering  purposes.  Also  implemented  were  some  table  reduction 
methods  to  be  discussed  later  in  the  chapter. 

5.1  Parser  Time 

It  is  well  known  that  LR(k)  parsers  (and  derivatives  such  as  SLR,  LALR)  will  accept  a 
sentence  in  time  linearly  proportional  to  the  length  of  that  sentence.  This  is  because 
the  parser  passes  over  the  input  only  once  and  there  is  a  constant  bound  on  the 
number  of  actions  that  will  be  done  for  each  input  symbol. 

The  TCG  parsers  discussed  in  this  thesis  have  the  same  property  of  accepting  a  sen¬ 
tence  in  linear  time.  This  can  be  seen  from  correspondence  between  full-TCGs  and 
LR(k)  grammars.  It  was  shown  in  the  last  chapter  that  for  an  LR(k)  grammar  G  and  the 
corresponding  full-TCG  G\  the  machines  Mi^^k){G)  and  M(G')  are  equivalent  in  the 
sense  that  they  will  perform  essentially  the  same  actions  when  parsing  a  sentence  in 
the  language.  The  only  difference  was  that  the  former  might  look  at  a  context  string 
when  the  latter  reads  the  same  context  string  and  then  pushes  it  back  onto  the  input 
queue. 

There  is  little  difference  between  looking  and  reading.  In  both  cases  the  parser 
needs  a  holding  area  of  size  k  where  it  gets  the  next  input  symbol  from.  Also,  in  both 
cases  the  context  string  is  read  from  the  holding  area  once  and  then  it  is  reused  later, 
after  the  reduce.  The  implementation  can  be  almost  the  same:  in  both  cases  the  input 
to  be  reused  can  still  be  in  the  data  structure  used  to  represent  the  input  string 
(although  technically  in  “free  space”  because  the  queue  pointer  has  passed  it),  and 
getting  ready  for  the  reuse  is  simply  a  matter  of  adjusting  the  queue  pointer.  The  TCG 
parsers  do  require  moving  the  terminal  context  onto  the  L-stack,  but  that  context 
would  have  had  to  have  been  examined  anyway  so  that  there  is  only  a  little  increase  in 
the  amount  of  time  required  —  certainly  bounded  by  a  constant  factor. 
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For  k  —  1  there  would  probably  be  little  noticable  difference  in  the  time  required  by 
an  LR(k)  parser  for  a  grammar  and  a  TCG  parser  for  the  corresponding  full-TCG.  For 
k>  1  nobody  implements  LR(k)  parsers  because  the  description  of  them  implies  that  at 
every  step  the  next  k  symbols  must  be  examined.  This  is  certainly  more  time  consum¬ 
ing  than  the  TCG  parser  which  has  the  effect  of  looking  ahead  k  symbols  only  for 
reduces.  Thus  there  is  little  or  no  time  difference  between  parsing  with  an  LR(k) 
parser  or  the  corresponding  full-TCG  parser. 

The  other  parsers  described  in  this  thesis  (for  RTCGs  and  SRTCGs)  have  at  least  the 
same  time  properties  as  the  corresponding  full-TCG  parsers.  Because  there  are  some 
terminal  contexts  which  needn’t  be  read  twice  in  them  they  are  usually  better.  This  is 
clear  because  those  other  parsers  are  the  same  as  the  full-TCG  parser  except  for  state 
merging  and  the  removal  of  some  terminal  contexts. 

There  is  still  the  question  of  how  much  worse  they  are  than  say,  an  LALR  parser.  The 
RTCG  and  SRTCG  parsers  would  operate  slightly  slower  because  they  have  to  read  con¬ 
text  where  the  LALR  parser  doesn’t  have  to.  The  SRTCG  parser  would  be  almost  as  fast 
as  the  LALR  parser  because  the  SRTCG  grammar  was  formed  by  adding  context  only  to 
those  productions  which  needed  lookahead  -  so  that  the  LALR  parser  would  be  looking 
at  that  context  in  those  states  where  the  LR(0)  machine  is  inadequate  because  of 
reductions  of  those  productions.  The  only  difference  would  be  that  there  might  be 
some  productions  which  are  inadequately  reduced  in  some  states  of  the  LR(O)  machine, 
but  are  all  right  in  other  states.  The  LALR  machine  only  has  to  look  at  context  in  the 
inadequate  states,  whereas  the  SRTCG  has  to  read  the  context  for  all  reductions  of  the 
given  production. 

The  RTCG  parser  would  seem  to  be  slightly  worse  because  it  has  productions  with 
terminal  context  that  the  LALR  machine  never  looks  at.  However  a  table  reduction 
technique  called  context  removal  (to  be  described  later)  will  get  rid  of  such  superfluous 
context,  thus  making  it  the  same  as  an  SRTCG  parser  from  the  time  point  of  view. 

As  with  most  parsing  methods,  there  is  a  time-space  tradeoff  involved  in  the  imple¬ 
mentation  decisions  for  TCG  parsers.  To  save  space  one  might  be  willing  to  use  a  table 
compaction  technique  with  which  it  takes  longer  to  parse.  Because  it  will  be  seen  that 
there  is  somewhat  of  a  size  problem  for  TCG  parsers,  it  might  turn  out  that  a  tighter 
compaction  method  is  chosen  for  them  than  would  be  used  for  a  LALR  parser.  This 
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effect  would  cause  the  TCG  parsers  to  be  slower. 


All  things  considered,  the  parsing  speed  of  the  TCG  parsers  described  in  this  thesis 
need  not  cause  concern.  For  example,  compared  with  the  time  that  a  compiler  spends 
doing  semantic  actions  after  a  reduction,  the  time  actually  spent  doing  queue  and 
stack  manipulation  seems  insignificant. 

5.2  Parser  Space 

The  TCG  parsers  do  require  more  space  than  LALR  parsers.  It  is  to  be  hoped  that  the 
increase  can  be  minimized  by  table  reduction  techniques.  Then  a  small  size  increase 
may  be  worth  the  additional  power  offered  by  TCGs  over  CFGs. 

There  are  two  reasons  for  the  size  increase:  the  ARCing  process  increases  the 
number  of  productions  in  the  grammar,  and  more  states  are  generated.  Though  there 
is  some  relation  between  the  two,  they  will  be  discussed  each  in  turn. 

5.2.1  Increase  in  number  of  productions.  A  TCG  is  formed  from  a  CFG  by  A/?Cing  a 
subset  of  the  context-free  productions.  If  there  are  many  elements  of  that  subset, 
and/or  the  FOLLOW  sets  of  the  left-hand  sides  of  those  productions  are  large,  then  the 
resulting  grammar  could  have  many  more  productions  than  the  CFG.  This  can  cause  a 
size  increase  in  the  parser  (though  not  necessarily,  as  will  be  seen  below). 

The  worst  imaginable  thing  that  could  happen  in  going  from  a  CFG  to  a  fc-TCG  is  that 
every  production  would  have  to  be  replaced  by  ARCk  of  it,  and  that  the  FOLLOW *  sets  of 
every  nonterminal  contain  every  terminal.  A  grammar  that  approaches  this  in  the 
limit  is  Cy: 

S  -A^*.AnCk-lD 
S  -  B1B2...BnCk~1F 
S  ->  SA  iAz...AnCk~1E 
S  ->  SBlB2...BnCk-‘LD 
Ai  -»  X 
Bi  ->  A 
C  -»  di 

D  -»  a*  l^i^— 

2 

E  -»  Oi 

This  grammar  has  been  arranged  so  that  each  of  the  Ai~> ...  and  productions 

need  k  characters  of  context  to  get  a  determistic  TCG  parser,  and  each  of  those 
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productions  must  be  followed  by  all  possible  mk  strings  of  terminals.  The  original 
grammar  has  2n  +  2m  +  4  productions,  and  the  RkTCG  has  2nmk+2m  +  2  productions. 
For  large  n  and  medium  m,  this  represents  an  approximate  mk~ fold  increase.  Thus  the 
worst  possible  case  of  getting  \  T  \  k  times  as  many  productions  is  actually  possible. 

The  grammars  that  people  write  for  programming  languages  do  not  seem  to  have 
such  unreasonable  size  increases.  The  reason  is  that  there  is  a  fair  amount  of  redun¬ 
dancy  in  programs,  so  that  the  set  of  terminal  strings  that  can  follow  a  given  nontermi¬ 
nal  are  usually  a  small  subset  of  Tk.  The  instances  in  the  grammar  where  k>  1  is  neces¬ 
sary  should  be  very  local,  and  even  the  places  where  fc  =  l  is  necessary  do  not  represent 
all  of  the  productions.  The  following  table  gives  some  figures  on  the  grammar  size 
increases  for  the  grammars  mentioned  earlier.  All  of  the  grammars  required  a  max¬ 
imum  context  length  of  one,  except  SPL2  which  needed  a  maximum  context  length  of 
two. 


Language 

Grammar  Size 

CFG  RTCG  SRTCG 

L-prods 

C-prods 

AE 

7 

13 

13 

2 

0 

SPL 

27 

R9 

77 

7 

4 

XPL 

99 

333 

199 

22 

24 

OLGA 

124 

546 

294 

24 

20 

SPL2 

29 

86 

77 

8 

3 

The  average  increase  seems  to  be  about  three  to  four  times  for  the  RTCGs  and  two  to 
three  times  for  the  SRTCGs.  While  this  isn’t  nearly  as  bad  as  the  j  T  |  times  increase 
that  is  possible,  it  is  still  a  fairly  substantial  one.  The  grammar  SPL2  didn’t  come  any¬ 
where  near  the  possible  increase  factor  of  702  -  in  fact,  it  increased  slightly  less  than 
SPL,  The  reason  for  this  is  that  the  declarations  part  of  the  languages  were  slightly 
different,  with  the  result  that  SPL  had  one  more  C-production  than  SPL2.  The  instance 
where  k  -2  was  needed  did  not  cause  a  large  increase  in  the  number  of  productions 
because  there  were  very  few  strings  of  length  equal  to  2  that  could  follow  the  spot 
requiring  k=  2. 

One  reason  why  the  number  of  productions  might  matter  is  that  there  are  some 
tables  needed  for  parsing  which  are  indexed  by  production  number.  To  do  reductions 


5-5 


one  must  have  the  index  of  the  left-hand  side  nonterminal  and  the  number  of  symbols 
on  the  right-hand  side,  for  each  production.  (The  terminal  context  need  not  be  kept 
around  because  it  is  not  pushed  back  onto  the  input  queue:  it  is  recovered  by  resetting 
the  input  queue  pointer.)  Thus,  about  two  bytes  more  space  are  needed  for  each  addi¬ 
tional  production. 

It  is  possible  to  avoid  this  need  for  extra  space.  Because  of  the  suggested  method  of 
doing  reductions,  it  does  not  matter  wdiieh  of  a  group  of  productions  is  being  applied  if 
they  are  ARC’d  versions  of  the  same  context-free  production  and  if  they  have  the  same 
length  of  terminal  context.  This  means  that  the  productions  can  be  renumbered  so 
that  such  groups  all  have  the  same  number.  If  k  =  l  the  original  CFG  numbering  can  be 
obtained.  If  fc>l,  there  may  be  some  original  CFG  productions  which  exist  with  several 
different  lengths  of  terminal  context  in  the  TCG.  These  will  yield  several  groups  instead 
of  just  one,  but  the  resulting  grammar  size  will  still  be  almost  the  same  as  the  original 
one.  Call  the  transformation  just  described  the  production  combining  reduction. 

The  production  combining  reduction  should  almost  always  be  done.  There  is  an 
instance  where  it  might  be  advantageous  not  to  do  so:  if  space  is  not  very  important 
and  one  wishes  to  do  different  things  in  the  semantic  action  for  a  reduction  depending 
on  the  context  that  the  production  appears  in,  then  the  TCG  production  numbering 
could  provide  this  information  automatically. 

5.2.2  Increase  in  number  of  states.  There  are  more  states  in  a  TCG  parser  than  there 
are  in  the  corresponding  LALR  parser  (if  the  latter  exists).  There  are  two  reasons  for 
this.  First,  there  are  states  to  read  the  context  strings;  second,  there  is  state  splitting. 

The  extra  states  to  read  the  context  strings  do  not  cause  any  trouble.  If  the  context 
removal  reduction  is  done  then  only  the  last  symbol  of  context  causes  a  new  state, 
because  the  other  symbols  would  have  been  read  in  states  that  had  to  be  there  for 
other  reasons  too.  The  last  symbol  of  context  can  be  read  by  a  r e ad- and-r educe  transi¬ 
tion,  represented  in  the  goto  table  by  the  production  number  corresponding  to  the 
reduction  to  be  done.  Usually  no  space  is  lost  compared  to  the  LALR  machine  because 
the  latter  probably  had  a  lookahead  transition  on  the  same  symbol.  The  only  places 
where  the  LALR  parser  might  get  away  without  reading  a  context  symbol  that  the  TCG 
parser  must  read  are  those  places  where  a  production  is  adequately  reduced  in  the 
LR(O)  machine  —  and  the  same  production  is  inadequately  reduced  somewhere  else  in 
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the  LR(O)  machine.  Such  instances  seem  to  be  rare. 

The  other  cause  of  increase  in  the  number  of  states  is  the  state  splitting  caused  by 
the  different  contexts  on  the  same  context-free  production.  Of  course  this  state  split¬ 
ting  is  sometimes  one  of  the  attractive  features  of  the  TCG  method:  since  it  happens 
only  in  localized  parts  of  the  machine  it  gives  an  automatic  method  of  getting  the  full 
power  of  LR(k)  where  it  is  needed  -  not  everywhere.  There  are  some  cases,  though, 
where  there  is  more  state  splitting  than  one  would  hope  for. 

Consider  CB : 

S  Aa,i  l^i^m 
A  -*  kb 
A  -*  b 

The  LR(Q)  machine  has  a  state  like 

(A  ->6*6  )  (A  -*&•) 

which  is  nondetermistic.  The  LALR(l)  machine  resolves  that  by  doing  the  reduction  if 
the  next  symbol  is  one  of  a The  RTCG,  R\GB,  contains  the  productions 

Aai  ->  6 a<  l^i^m 

So  M(RiGq)  contains  the  m  states 
(A~>b»b)  (Aa^-^b^) 

Notice,  however,  that  approximately  the  same  amount  of  information  is  needed  by  the 
parser  in  either  case  if  the  parsers  are  to  have  the  error  property:  actions  are  needed 
for  m  context  characters.  Clever  table  reduction  methods  can  make  use  of  this  to 
greatly  diminish  the  size  differences. 

It  is  unfortunate  that  the  sort  of  situation  represented  by  grammar  Ge  occurs  fairly 
often  in  programming  language  grammars.  In  particular,  if  .4  is  “expression”,  there 
are  usually  many  contexts  in  a  programming  language  where  an  expression  can  occur. 
The  TCG  parsers  replicate  the  entire  expression  subparser  for  each  such  context. 

The  following  table  gives  some  idea  of  the  increase  in  the  raw  number  of  states  for 
various  types  of  parsers  in  practice.  In  all  cases  the  parsers  do  read-and-reduce  tran¬ 
sitions  on  the  final  symbol  of  the  production  wherever  that  can  be  done.  The  ‘‘number 
of  states”  excludes  the  target  states  of  such  transitions,  since  they  needn’t  be 
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explicitly  represented  in  the  parsing  tables.  The  notation  "RTCGc”  means  an  RTCG 
parser  with  the  context  removal  reduction  applied  to  the  machine  afterwards.  The  con¬ 
text  removal  reduction  replaces  the  C-productions  by  the  context-free  parts  of  them. 
Then  the  transitions  for  the  context  part  of  those  productions  are  removed  from  the 
machine,  and  any  inaccessible  states  are  dropped.  As  was  mentioned  in  the  last 
chapter,  this  does  not  affect  the  determinism  of  the  parser.  The  prototype  parser  gen¬ 
erator  was  not  programmed  to  handle  large  grammars,  so  that  it  wouldn’t  run  on  the 
OLGA  TCGs  or  on  the  XPL  RTCG  (though  it  got  almost  all  the  way  through  the  latter  - 
enough  to  estimate  some  figures). 


Language 

Parser 

Number  of  States 

AE 

LALR(l) 

8 

RTCG 

11 

SPL 

LALR(l) 

32 

SRTCG 

47 

RTCG 

95 

RTCGc 

83 

XPL 

LALR(l) 

92 

RTCG 

~250 

SRTCG 

117 

SPL2 

SRTCG 

47 

RTCG 

100 

As  mentioned  above,  table  reduction  techniques  can  help  reduce  the  size  penalty  of 
TCG  parsers.  The  fact  that  state  splitting  seems  to  cause  the  problem  leads  naturally 
to  the  idea  of  merging  compatible  states  to  reduce  table  size,  because  instances  such 
as  the  split  caused  in  G$  are  solved  by  this. 

Joliat  introduced  a  table  reduction  method  based  on  merging  compatible  states 
[Joliat  73].  This  has  been  implemented  for  application  to  the  machine  the  tables  out¬ 
put  from  the  prototype  parser  generator.  It  seems  to  do  a  good  job  at  reducing  the 
table  sizes,  and  was  used  on  all  machines  whose  sizes  are  reported  later. 

The  Joliat  reduction  proceeds  as  follows: 

l)  The  original  machine  is  split  into  two  matrices:  M',  containing  the  next  state  for 
a  given  (state, symbol)  pair;  and  E,  the  error  matrix,  which  is  a  bit  matrix 
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containing  a  1  wherever  a  given  (state, symbol)  pair  indicates  an  error.  Thus,  M’ 
contains  “don’t  care”  entries  where  the  original  matrix  contained  error  entries. 

2)  (Optional)  The  states  may  be  reassigned  to  reduce  the  bit  width  needed  in  the 
table,  and  to  try  to  increase  commonality  in  the  matrix  (see  below).  If  this  step 
is  done,  the  resulting  machine  M”  is  called  the  reassigned  state  machine,  and 
the  state  assignment  reduction  has  been  done.  It  is  an  optional  step  because  it 
introduces  another  level  of  indirection  into  the  parsing  algorithm. 

3)  An  approximate  (but  almost  optimal,  in  practice)  method  of  minimizing  incom¬ 
pletely  specified  finite  state  machines  is  used  to  combine  compatible  states  of  M’ 
(or  M“).  For  this  purpose,  both  the  rows  and  columns  can  be  regarded  as 
“states”.  The  minimization  can  be  done  Row-then-Column  or  Column-then-Row. 

4)  (Optional)  Semantieless  chain  rules  may  be  eliminated. 

In  order  to  maintain  the  full  error  detection  capabilities  of  the  original  parser,  the  ori¬ 
ginal  state  numbers  must  still  be  accessible  during  parsing  (because  the  bit  matrix  E  is 
indexed  by  them).  This  means  that  the  state  numbers  in  the  table  are  regarded  as  out¬ 
puts  for  compatibility  purposes.  Hence,  Joliat’s  technique  ends  up  overlaying  states 
which  differ  in  that  they  are  often  specified  for  disjoint  sets  of  input  symbols.  However 
it  also  overlays  states  which  are  the  same  except  for  a  small  set  of  input  symbols,  with 
each  state  specified  for  a  part  of  that  small  set.  This  takes  care  of  the  problem  with 
the  expression  part  of  a  programming  language  grammar. 

The  RTCG  parsers  have  better  error  detection  than  the  LALR  parsers,  because  the 
latter  may  do  some  reductions  when  the  next  input  symbol  is  wrong.  Therefore,  it  is 
not  quite  fair  to  compare  the  sizes  of  the  two  parsers  without  taking  account  of  this 
fact.  If  one  was  willing  to  accept  the  lesser  error  detection  capabilities  of  the  LALR 
parser,  then  Joliat’s  reduction  technique  could  be  modified  by  allowing  states  to  be 
completely  merged  (the  error  rows  Loo)  if  they  differ  only  in  that  one  has  an  error  bit 
where  the  other  has  a  reduction. 

The  “state  assignment”  reduction  is  a  method  whereby  the  entries  in  the  goto  table 
are  replaced  by  very  small  integers,  with  a  translation  method  to  get  the  original  ones 
back.  It  is  really  a  “state-and-production  assignment”,  but  Joliat’s  term  will  be  used  in 
this  thesis.  The  method  is  as  follows: 
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1)  The  states  are  renumbered  so  that  states  which  are  accessed  by  the  same  sym¬ 
bol  are  grouped  together. 

2)  For  each  column,  there  are  only  a  few  distinct  transitions  to  non-reduce  states, 
and  these  all  have  the  same  state-access  symbol  (the  column  heading).  A  base 
table  is  kept  with  one  entry  per  column,  giving  the  number  of  the  first  state  with 
that  state-access  symbol.  Thus,  in  the  table  only  a  few  bits  are  needed  to  hold 
the  displacement  from  that  first  state  number. 

3)  For  each  row,  there  are  only  a  few  distinct  read-and-reduce  numbers.  These  are 
assigned  sequential  numbers  (starting  from  0  for  each  row).  There  is  a  reduce 
table  which  contains  lists  of  read-and-reduce  numbers.  There  are  usually  many 
rows  which  have  the  same  read-and-reduce  numbers  in  them  (or  subsets  of  some 
particular  set),  so  the  reduce  table  can  be  optimized  to  overlap  common  lists  of 
read-and-reduce  numbers.  There  is  a  pointer-to-reduce  table  which  contains,  for 
each  row,  the  start  in  the  reduce  table  of  the  list  for  that  row.  Again,  only  a  few 
bits  are  needed  to  hold  the  displacement  into  this  list. 

The  above  procedure  certainly  reduces  the  bit  width  required  in  the  goto  table,  prob¬ 
ably  more  than  necessary  (since  for  speed  purposes  it  is  probably  not  wise  to  pack  the 
entries  tighter  than  one  per  addressable  storage  unit  -  a  byte,  say).  Perhaps  a  more 
important  benefit  is  that  there  are  now  very  few  distinct  entries  in  the  table  (mostly 
0’s,  l’s  and  2’s)  so  that  there  is  a  better  chance  that  two  given  rows  or  columns  can  be 
overlayed. 

The  Joliat  method  of  table  reduction  seems  to  do  quite  well  at  reducing  the  goto 
table  sizes  of  TCG  parsers.  The  following  table  gives  some  representative  table  sizes. 
All  of  the  auxilliary  arrays  required  are  are  included  in  the  space  requirements.  The 
“parser  types”  have  characters  appended  to  them  giving  the  reduction  methods  used: 
‘p’  for  production  combining,  ‘c’  for  context  removal,  and  ‘s’  for  state  reassignment. 
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Language  Parser 

Unreduced  Size  Reduced  Size 

AE  LALRs 

RTCGs 
RTCGps 

94  88 

136  119 

124  101 

SPL  LALR 

LALRs 

SRTCG 

SRTCGp 

SRTC Gps 

RTCG 

RTCGp 

RTCGpc 

RTCGps 

RTCGp cs 

1142  544 

1142  422 

1742  734 

1652  677 

1652  537 

3408  1261 

3284  1237 

2875  1093 

3284  975 

2875  893 

XPL  LALR 

LALRs 
SRTCG 
SRTCGp 
SRTC Gps 

8386  2324 

8306  1959 

21224  4507 

10611  2641 

10611  2255 

SPL2  SRTCG 

SRTCGp 
SRTCGps 
RTCG 
RTCGp 
RTCGps 

1916  851 

1826  737 

1826  628 

3572  1334 

3458  1220 

3458  1079 

The  sizes  are  in  bytes.  Almost  all  entries  could  fit  into  a  single  byte;  the  exception  was 
the  SRTCG  parser  for  XPL  (without  production  renumbering),  which  explains  why  those 
entries  arc  so  large. 

It  can  be  seen  that  the  best  RTCG  parsers  for  moderately  large  languages  are  about 
twice  the  size  of  the  best  LALR  parsers  (using  the  reduction  methods  described 
herein).  Recall  that  this  buys  enhanced  error  detection.  The  SRTCG  parsers  are  only 
about  15%  larger  than  the  corresponding  LALR  parsers.  A  lot  of  the  size  increase  is  due 
to  the  fact  that  some  tables  depend  on  the  original  number  of  states.  For  instance, 
even  though  the  error  matrix  is  a  bit  matrix,  it  starts  to  require  a  sizeable  amount  of 
space  for  the  larger  grammars. 

These  sort  of  size  increases  are  not  so  bad  when  it  is  realized  that  the  parsing  tables 
form  only  a  small  portion  of  the  total  compiler.  It  is  certainly  feasible  to  use  such 
tables,  and  the  increased  flexibility  for  the  grammar  writer  may  be  worth  it. 
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5.3  Parser  Generator  Time  and  Space 


The  time  and  space  properties  of  the  parser  generator  are  much  less  important  than, 
those  of  the  parser,  because  it  runs  much  more  rarely.  What  matters  is  whether  it  is 
implementable  in  such  a  way  that  it  runs  at  all  (and  preferably,  doesn’t  take  all  day). 
In  this  section,  implementation  methods  will  be  discussed  for  PGEN  so  that  it  can  han¬ 
dle  reasonable  sized  TCGs  of  the  type  likely  to  be  used  to  describe  a  programming 
language. 

By  far  the  greater  problem  is  space.  The  prototype  implementation  was  not  able  to 
handle  the  OLGA  grammars  because  it  ran  out  of  space.  This  was  the  result  of  two 
things:  (a)  the  method  used  for  keeping  productions  and  items  was  naive,  entailing  a 
lot  of  repetition  of  similar  strings;  and  (b)  the  PDP-11  that  it  ran  on  was  limited  to  a  64 
kbyte  data  area,  and  the  program  had  no  provision  for  using  secondary  storage  to  han¬ 
dle  the  overflow.  Many  computer  architectures,  especially  the  newer  ones,  do  not  have 
such  an  unrealistically  small  address  space.  The  prototype  parser  generator  could 
have  handled  reasonablely  sized  grammars  on  such  a  machine,  because  time  seemed 
to  be  no  problem  for  it.  However  there  are  some  methods  whereby  a  smarter  imple¬ 
mentation  might  run  on  such  a  small  machine. 

One  of  the  things  that  took  up  a  lot  of  room  in  the  prototype  parser  generator  was 
the  space  for  the  productions.  It  represented  each  production  explicitly.  A  big  space 
saving  would  be  gained  by  keeping  only  the  context-free  part  of  the  productions  expli¬ 
citly,  together  with  a  set  of  context  strings  to  be  added  to  that  context-free  part  to  get 
the  productions  of  the  TCG. 

The  states  are  represented  in  the  parser  generator  by  the  sets  of  kernel  items  only. 
Still,  there  are  often  many  items  which  represent  the  same  context-free  marked  pro¬ 
duction,  but  with  different  terminal  context  parts.  These  too  are  represented  explicitly 
in  the  prototype,  and  they  too  may  be  replaced  by  a  marked  context  free  production 
and  a  set  of  terminal  context  strings  (some  escape  is  needed  for  those  items  where 
part  of  the  terminal  context  is  marked). 

An  efficient  method  is  needed  for  storing  a  set  of  context  strings.  For  a  typical 
grammar,  not  many  of  the  \  T\  k  possible  context  strings  are  used,  especially  since  k>  1 
is  needed  in  only  a  very  few  spots.  Thus,  one  might  keep  a  small  (256  entry?)  hash 
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table  of  context  strings  used,  and  represent  each  context  string  by  the  index  in  that 
table.  Then  a  set  of  context  strings  could  be  represented  as  a  few  (8?)  words  wherein 
each  bit  indicates  whether  or  not  the  corresponding  string  is  in  the  set. 

A  further  saving  is  possible.  For  a  typical  grammar,  not  many  of  the  2^'*  possible 
sets  of  context  strings  are  used;  mostly,  FOLLOW  sets  of  the  nonterminals  and  a  few 
subsets  of  each  of  those  are  the  sorts  of  sets  that  will  tend  to  be  needed  in  the  parser 
generator.  Thus,  one  can  represent  a  set  of  context  strings  by  a  -pointer  to  the  few 
words  of  bits  just  described,  which  are  kept  in  a  separate  area.  Then  the  same  sets  can 
be  pointed  at  from  the  many  places  in  the  parser  generator  where  they  are  needed. 

It  seems  certain  that  such  methods  could  be  used  to  create  a  parser  generator 
which  would  run  on  the  PDP-11  and  handle  any  1-TCG  for  a  programming  language,  and 
2-  and  3-  TCGs  if  the  multi-symbol  context  is  not  needed  in  very  many  places.  If  a  disk 
were  used  to  hold  the  kernels  of  those  states  not  currently  being  looked  at,  then  quite 
large  RTCGs  could  be  handled. 

So,  it  seems  that  the  intermediate  size  of  the  TCG  machines  should  not  be  a  barrier 
to  their  use.  The  previous  section  has  shown  that  the  final  parser  sizes  are  not  horren¬ 
dous  either,  so  the  TCG  method  is  a  feasible  way  of  getting  the  full  power  of  LR(k) 
grammars,  for  “reasonable”  grammars. 
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6.  EVALUATION  AND  CONCLUSIONS 


This  chapter  will  summarize  what  has  been  done,  comparing  the  methods  proposed 
with  previous  methods.  In  this  thesis  several  different  methods  have  been  proposed  for 
generating  parsers  for  context-free  languages,  all  of  them  based  on  parsing  a  related 
terminal  context  grammar.  The  existing  method  which  might  be  used  instead  is  the 
use  of  a  LALR(l)  parser  generator.  This  requires  modifying  the  grammar  into  a  form 
that  is  LALR(l).  It  will  be  used  for  comparison  with  the  TCG  parsers. 

The  TCG  parsers  have  the  disadvantage  relative  to  the  LALR(l)  parsers  that  they  are 
harder  to  generate  and  take  up  more  space.  However,  it  has  been  shown  in  chapter  5 
that  for  practical  examples  the  use  of  the  TCG  parsers  is  feasible,  and  the  space 
penalty  is  not  too  large.  If  the  advantages  of  using  TCG  parsers  are  great  enough,  these 
disadvantages  may  be  outweighed. 

The  advantages  of  the  TCG  methods  are  that  they  allow  the  use  of  context-free  gram¬ 
mars  which  cannot  be  handled  by  the  LALR(l)  method,  and  it  is  possible  to  get  better 
error  handling.  Grammars  which  are  LR(k)  can  be  handled,  with  k>  1  allowed.  How 
useful  this  is  depends  on  how  often  one  would  like  to  write  a  grammar  which  requires 
fc>l  or  the  full  power  of  LR,  and  how  difficult  and/or  inconvenient  it  is  to  modify  the 
desirable  grammar  into  a  LALR(l)  grammar.  This  issue  will  be  discussed  at  some 
length  in  the  next  section. 

6.1  Is  k>l  needed? 

When  this  thesis  was  being  written,  several  groups  were  developing  programming 
language  grammars.  One  was  for  a  standardized  language  (ISO  Pascal)  and  another  was 
for  a  new  language  being  designed  (call  it  ‘Wiz’).  These  grammars  were  examined 
before  their  authors  had  a  chance  to  check  whether  they  were  LALR(l).  This  presented 
the  opportunity  to  see  whether  ‘natural’  grammars  require  fc>l.  The  examples 
presented  in  this  section  were  adapated  from  the  grammars  examined  by  extracting 
relevent  subparts  and  using  terminals  for  other  parts  which  did  not  matter.  Some  of 
the  fragments  then  inspired  other  grammars  to  illustrate  other  points. 
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The  first  grammar  comes  from  a  description  of  records  in  Pascal: 

1:  R  ‘record’  FieldL_list  Y  ‘end’ 

2:  Field_list  -*  Fixed_part 
3:  j  Fixed_part  Y  Far_parf 

4:  |  Var_part 

5:  Fixe  depart  -»  ‘x’ 

6:  |  Fixed_part  Y  *x' 

7:  Var_part  -*  ’case’  ‘y’  'of*  ’z’ 

The  LR(O)  machine  for  this  grammar  has  a  state: 

Field_list  Fixed_part  * 

Field__list  -*  FixecL_part  •  Y  Var_part 

Because  of  production  1,  a  Y  can  follow  a  Field_list  in  this  state,  so  k  —  1  is  insufficient. 
With  k  —  2,  the  actions  can  be  decided  according  to  whether  the  next  input  tokens  are 
‘end’  or  Y  ‘case’. 

* 

To  make  this  grammar  LALR(l),  an  experienced  grammar  writer  would  probably 
create  something  like: 

1:  R  -*  ’record’  Fixed_list_semi  ‘end’ 

2:  Fixed__list_semi  -»  Fixed_parV 
3:  |  Fixed__part  Y  Var_part 

4:  |  Var_part 

5,6,7:  (as  before) 

The  technique  that  has  been  used  can  be  called  the  ‘new  nonterminal  fixup’.  It  creates 
a  new  nonterminal  which  generates  the  language  of  the  old  nonterminal  followed  by 
k-1  context  tokens.  This  is  put  in  place  of  the  old  nonterminal  where  necessary.  What 
follows  that  old  nonterminal  must  be  modified  so  that  it  no  longer  generates  the  con¬ 
text  tokens.  This  is  fairly  easy  to  do  in  cases  like  the  above  where  the  context  tokens 
are  written  explicitly  following  the  old  nonterminal.  It  is  harder  to  do  when  nontermi¬ 
nals  follow  the  old  nonterminal,  and  harder  still  when  the  old  nonterminal  appears  at 
the  end  of  a  production. 

It  can  be  argued  that  the  original  form  of  the  grammar  was  easier  to  write  and 
understand,  because  it  didn’t  have  the  ‘same’  semicolon  written  in  a  number  of  places. 
If  one  of  the  semicolons  had  been  left  out  of  any  of  the  Fixed_list_semi  alternatives, 
this  might  not  have  been  obvious. 
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Also,  it  is  possible  that  both  the  old  and  the  new  nonterminal  are  needed  in  the  final 
grammar.  The  semantic  actions  for  them  will  be  the  same,  so  the  grammar  writer 
must  ensure  that  this  is  so.  This  adds  another  place  where  the  grammar  writer’s  job  is 
made  more  difficult  because  of  the  limitation  k  =  1. 

The  second  grammar  to  be  discussed  comes  from  the  Wiz  grammar. 

1:  5  -  *('  Val  *>’  ‘y’  *)’ 

2:  Val  ~>  Val2 
3:  |  Val  •&’  Val2 

4:  Val2  ->  ‘x’ 

5:  |  Val2  ‘x’ 

This  leads  to  a  state  like: 

Val  -»  Val  *&'  Val  2  * 

Val 2  Val2*'-  *x’ 

If  a  comes  next,  it  could  be  the  one  in  production  1  or  the  one  in  production  5.  With 
k-2  the  cases  can  be  distinguished  by  *>’  or  ‘x’.  The  grammar  would  probably  be 
made  LALR(l)  by  using  what  will  be  called  the  ‘token  fixup’.  This  involves  making  the 
scanner  look  ahead  after  a  and  returning  a  new  token,  if  the  next  character  is  a 
*>’.  This  is  the  technique  usually  used.  In  fact,  most  compiler  writers  automatically 
assume  that  the  scanner  will  recognize  all  such  multicharacter  tokens  (which  are  mul¬ 
ticharacter  because  of  character  set  limitations),  whether  or  not  this  is  required  to 
make  the  parser  LALR(l). 

However,  there  is  something  to  be  said  for  giving  the  grammar  writer  the  added 
flexibility  of  writing  the  characters  separately.  First,  there  are  probably  instances  in 
the  parser  where  it  is  known  that  the  first  character  of  a  multicharater  token  may 
appear,  but  not  the  multicharacter  token.  In  those  cases  the  scanner  is  wasting  time 
by  checking  for  the  multicharacter  token.  For  example,  a  can  usually  appear  in 
many  places  in  a  programming  language  grammar,  and  the  scanner  would  have  to 
check  for  a  ’=’  at  each  place  if  the  token  ‘:=’  is  used  for  assignment. 

Also,  the  implementation  of  the  scanner  could  easily  be  such  that  every  multichar¬ 
acter  token  that  has  to  be  recognized  adds  to  the  scanning  time.  A  common  scanner 
implementation  has  a  series  of  comparisons  against  a  list  of  possible  start-of- 
multicharacter-token  characters.  Then  the  longer  the  list,  the  greater  the  average 
time  to  scan  any  token  (except  identifiers  and  numbers,  probably).  It  is  often  the  case 
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that  the  scanner  is  a  bottleneck  in  compiling,  so  anything  that  adds  to  scanning  time 
is  to  be  avoided. 

A  more  serious  problem  with  the  token  fixup  technique  is  that  there  may  be 
instances  in  the  language  where  the  characters  of  a  multicharacter  token  can  appear 
together,  and  yet  where  it  is  inconvenient  (and/or  difficult  to  understand)  to  regard 
them  as  a  single  token.  For  example: 

1  :S  -*  *(’  Val  •>’  V  ')’ 

2:  Val  ->  Val  2 
3:  j  Val2  Val2  '+' 

4:  Val  2  ->  Val2‘- 
5:  i  Val  2  ’>’ 

6:  |  !x’ 

(Here,  Val  could  be  a  postfix  expression,  with  *+'  being  binary  addition,  being  unary 
negation,  and  *>'  being  unary  indirection.)  A  Val  can  produce  ‘x’  but  it  would  be 

confusing  to  regard  *>’  as  a  single  token  in  that  context  because  it  doesn’t  mean 
“arrow”  —  it  means  “negation  then  indirection”.  Using  the  token  fixup  requires  a 
grammar  which  has  Val  -»  Val  ’->’  as  a  special  case,  and  some  other  contortions  to 
make  it  unambiguous.  Note  that  the  new  nonterminal  fixup  is  also  rather  confusing  to 
use  here  because  it  requires  a  nonterminal  which  produces  a  “value  followed  by  an 
arrow”.  The  difficulty  posed  by  grammars  such  as  the  above  probably  doesn’t  occur 
that  often  because  language  designers  avoid  lexical  representations  that  might  cause 
such  problems.  Why  should  they  be  under  that  constraint  if  there  is  a  method  which 
does  not  require  it?  Also,  the  language  may  have  been  imposed  by  some  external  per¬ 
son,  and  the  implementer  may  have  no  choice  but  to  parse  such  a  grammar. 

A  final,  perhaps  minor,  reason  why  it  might  be  desirable  to  allow  multicharacter 
tokens  to  be  written  out  as  separate  characters  has  to  do  with  error  recovery.  A 
table-based  parser  generally  recovers  from  a  parsing  error  by  inserting  or  replacing  or 
deleting  tokens  on  the  input  stream.  Which  action  is  done  is  often  based  on  the  termi¬ 
nal  transitions  in  and  around  the  state  where  the  error  was  detected.  Using  this 
method,  it  is  easier  to  discover  the  relation  between  the  input  stream  and  a  multichar¬ 
acter  token  if  the  successive  characters  are  read  by  the  parser.  For  instance,  it  is 
easier  to  tell  that  ’='  is  closer  to  '='  than  to  '(’.  Thus  if  the  parser  sees  an  ’  =  ’  after 
an  identifier  at  the  beginning  of  a  statement,  it  is  better  to  insert  a  rather  than 
replace  ’=’  by  ’(’  (i.e.,  at  the  head  of  a  subscript  list).  As  an  extreme  example  of  the 
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possibilities,  one  could  write  out  keywords  letter  by  letter  and  allow  the  error  recovery 
mechanism  to  be  used  for  spelling  corrections.  This  would  only  be  feasible  for  small 
languages,  or  if  memory  becomes  very  cheap. 

Here  is  another  example  from  the  Wiz  grammar. 

1:  Bmodule  ->  ‘module’  'm' 

2:  S  BmGdule  ‘©'  Parm_list  *:'  Val 

3:  Val  ->  Val  1 
4:  f  Val  '+’  Val  1 

5:  Val  1  -»  ‘x’ 

6:  |  Val  1  ‘x’ 

7:  Parm_list  ->  ’p’ 

8:  |  Parm__list  *p’ 

The  LR(O)  machine  has  a  state: 

Val  -»  Val  *+’  Val  1  • 

Vail  Val  1  •’©’  ‘x’ 

The  can  follow  the  Val  because  it  can  follow  a  Bmodule .  Here,  both  the  fixup  tech¬ 
niques  given  above  are  unpleasant  to  use.  The  new  nonterminal  nxup  requires  having  a 
nonterminal  which  produces  a  value  followed  by  an  This  in  turn  requires  another 
new  nonterminal  which  produces  a  Bmodule  followed  by  an  By  the  time  one  has  a 
LALR(l)  grammar,  it  is  hard  to  read  and  understand.  Even  worse  is  if  there  are  a 
number  of  things  that  can  appear  where  the  is.  Either  a  large  number  of  produc¬ 
tions  are  required  or  the  scanner  must  return  a  single  token  and  an  associated  “scan- 
value”  for  each  token  in  the  class.  The  latter  action  is  bad  because  the  scanvalue  is 
passed  to  the  “Value  semantic  routine,  which  has  no  use  for  it:  a  temporary  loca¬ 
tion  is  needed,  setting  up  a  complicated  interrelationship  among  the  semantic  rou¬ 
tines. 

The  token  fixup  is  also  unpleasant  to  use  for  this  grammar.  It  requires  making 
‘p’  into  a  single  token,  even  though  there  is  no  relationship  between  the  two  tokens.  In 
fact,  it  may  be  difficult  for  the  scanner  to  recognize  p’  as  a  single  token:  if  ‘p*  is  an 
identifier  which  is  found  to  be  a  parameter  via  a  symbol  table  lookup,  then  the  usual 
method  for  recognizing  multicharacter  tokens  does  not  work  —  it  must  be  made  more 
complicated. 

There  were  about  ten  other  instances  in  the  two  grammars  investigated  where  k  =2 
would  allow  the  grammar  to  be  parsed  as  is,  but  k- 1  led  to  nondeterminism.  They 
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were  similar  to  the  examples  shown  above.  The  grammar  writers  knew  that  they 
would  eventually  want  an  LALR(l)  grammar,  and  were  well-versed  in  LALR  theory.  They 
may  have  consciously  or  unconsciously  used  techniques  which  would  help  the  result  be 
LALR(l).  One  wonders  how  many  other  places  k>l  would  be  needed  had  this  not  been 
the  case.  Anyway,  even  the  examples  found  show  that  they  would  have  finished  faster, 
with  cleaner  grammars,  had  they  been  allowed  the  use  of  fc>l. 

These  days  grammars  are  being  written  by  people  other  than  compiler  specialists 
(e.g.f  for  machine  description  languages  by  hardware  experts  or  for  input  languages  by 
graphics  experts).  They  may  not  be  very  aware  of  LALR  parsing  theory,  so  it  may  be 
harder  for  them  to  modify  their  grammars  by  hand  to  get  an  LALR(l)  grammar.  Any 
aids,  such  as  allowing  the  use  of  &>1,  would  be  welcome. 

What  about  automating  the  fixup  methods  described  above?  Tt  has  been  seen  that 
the  use  of  the  fixups  can  lead  to  very  messy  grammars,  so  that  for  their  automation  to 
be  useful,  the  user  should  be  able  to  write  semantic  actions  based  on  the  original  gram¬ 
mar  and  never  look  at  the  transformed  one.  This  seems  to  be  a  very  difficult  thing  to 
do  (especially  if  one  has  to  debug  the  parser  that  comes  out).  It  might  be  an  area  for 
future  research  to  accomplish  this,  but  allowing  k>  1  is  a  reasonable  method  for 
accomplishing  the  same  thing.  And,  it  is  “simpler”.  Which  method  yields  smaller 
and/or  faster  parsers  is  at  this  point  unclear.  It  is  easy  to  imagine  that  there  neither 
method  is  superior  all  the  time. 

6.2  Other  TCG  Advantages 

One  of  the  other  advantages  that  has  been  given  for  TCG  parsers  is  that  they  can  han¬ 
dle  full  LR  languages  (instead  of  just  LALR).  In  the  example  grammars  examined,  no 
cases  were  found  where  LR  would  have  worked  where  LALR  wouldn’t.  It  is  probably  true 
that  this  is  generally  the  case  for  programming  language  grammars,  in  which  case 
allowing  full  LR  only  has  the  advantage  that  there  might  be  some  isolated  cases  in 
which  it  might  be  useful.  It  can  be  regarded  as  a  bonus  feature  that  costs  very  little 
(because  “state  splitting”  Is  confined  to  those  places  where  it  is  needed). 

What  might  be  of  more  interest  is  that  the  full  error  detection  power  of  LR  parsers  is 
obtained  with  the  RTCG  method.  It  has  been  seen  that  the  use  of  the  RTCG  method 
allows  the  detection  of  errors  as  early  as  possible  in  the  parse,  whereas  the  LALR 
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method  may  let  several  reductions  been  done  after  that  point.  If  one  is  -willing  to  trade 
off  parser  space  for  error  detection  then  the  RTCG  method  may  be  the  one  of  choice. 

6.3  Conclusions 

This  thesis  has  investigated  a  class  of  grammars  called  terminal  context  grammars. 
The  theory  of  TCGs  was  developed,  and  it  was  shown  that  they  do  not  have  any  greater 
generative  power  than  CFGs.  Also,  a  subset  of  TCGs  was  identified  that  can  be  parsed 
easily  (about  as  easily  as  LR(1)  grammars  can  be  parsed).  Those  so-called  TDRP  gram¬ 
mars  are  a  special  case  of  the  DRP  grammars  identified  by  Turnbull  [Turnbull],  so  that 
a  lot  of  his  work  is  applicable  to  them. 

The  rest  of  the  thesis  dealt  with  an  application  of  TCGs:  using  them  to  expand  the 
class  of  grammars  that  can  be  easily  parsed  from  the  LALR(l)  class.  It  was  argued, 
with  examples,  that  the  restriction  to  LALR(l)  grammars  sometimes  hampers  grammar 
writers.  It  is  possible  to  generate  TCGs  which  can  be  parsed  easily  from  CFGs  for  which 
the  LALR(l)  method  does  not  work. 

Several  different  methods  of  generating  TCGs  from  CFGs  were  discussed.  First  there 
was  the  RTCG  method,  which  adds  right  context  to  productions  for  lookahead  and  to 
enable  the  closures  to  be  done  correctly.  It  works  for  any  LR(k)  grammar  and  pro¬ 
duces  a  parser  which  is  usually  a  lot  smaller  than  the  LR(k)  parser  because  it  acts  as  if 
different  fc’s  were  being  used  in  different  parts  of  the  parser.  The  error  detection  pro¬ 
perties  of  the  RTCG  parsers  are  the  same  as  those  of  the  corresponding  LR(k)  parser: 
errors  are  detected  as  early  as  possible  in  a  parse.  Then  there  was  the  SRTCG  method, 
which  adds  right  context  to  productions  for  lookahead  only.  It  too  works  for  any  LR(k) 
grammar  and  produces  a  parser  which  is  at  least  as  small  as  the  RTCG  parser  (usually 
smaller).  The  difference  is  that  the  SRTCG  parser  may  do  some  reductions  after  an 
error  could  have  been  detected,  though  it  won’t  read  any  more.  (The  same  thing 
occurs  with  LALR  parsers.)  Finally,  there  was  the  TLALR  method,  which  is  basically  the 
LALR(k)  method  with  the  existing  machine  transitions  being  used  to  restrict  the 
number  of  symbols  that  must  be  examined  to  make  an  action  to  one.  For  k  —  1  it  is 
equivalent  to  LALR(l),  but  for  k>  1  there  are  some  grammars  which  are  LALR(k)  but  not 
TLALR(k). 
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Each  method  has  its  advantages  and  disadvantages: 


•  RTCG  parsers  have  the  best  error  detection.  They  can  be  used  instead  of  one 
of  the  other  methods  when  the  earlier  error  detection  is  needed  for  better 
error  recovery.  Their  disadvantage  is  that  the  parsing  tables  take  up  more 
room  than  either  SRTCG  tables  or  LALR  tables  (if  the  latter  exist).  The  space 
premium  is  not  so  large  that  the  method  can’t  be  used  -  for  practical  gram¬ 
mars,  the  reduced  RTCG  tables  are  limited  to  about  twice  the  size  of  LALR 
tables  or  one  and  a  half  times  the  size  of  the  SRTCG  tables. 

•  SRTCG  parsers  handle  a  larger  class  of  grammars  than  LALR  parsers  for  not 
too  much  of  a  space  penalty  (about  15%)  for  those  grammars  which  happen  to 
be  LALR.  One  could  use  SRTCG  parsers  all  the  time. 

•  TLALR  parsers  produce  the  smallest  tables,  so  perhaps  a  grammar  should  be 
checked  to  see  if  TLALR  works  before  using  one  of  the  other  methods.  This  is 
especially  true  if  parsing  speed  is  very  important,  because  the  tables  needn’t 
be  compacted  as  much,  meaning  that  less  indirections  would  have  to  be  done 
during  the  parse. 

Presentations  of  parsing  in  textbooks  and  the  literature  often  discuss  only  the  case  of 
k  =  1,  the  implication  being  that  k>  1  is  either  impractical  or  unimportant.  The 
methods  presented  in  this  thesis  and  the  examples  earlier  in  this  chapter  have  shown 
that  this  needn’t  be  the  case.  It  is  hard  to  decide  how  much  of  a  stifling  effect  the  k  —  1 
restriction  has  had  on  people  who  wanted  to  use  grammars.  One  thing  that  is  true  is 
that  language  designers  sometimes  include  in  their  goals  the  wish  that  the  language  be 
easily  parsable  with  a  LALR(l)  grammar  (e.g.,  Euclid  [Lampson  et  al.  77,  Sec.  14.2]).  It 
is  quite  possible  that  this  goal  has  resulted  in  language  features  (or  lack  thereof)  which 
the  designers  would  have  preferred  to  avoid. 

Whole  avenues  of  investigation  may  have  been  dismissed  because  of  the  thought: 
“Oh,  but  that  requires  lookahead  of  greater  than  one...”.  For  example,  someone  might 
have  thought  about  using  the  general  parser  error  recovery  method  to  correct  spelling 
errors  in  keywords,  and  then  dismissed  it  immediately  because  of  the  &>1  require¬ 
ment.  It  is  not  being  suggested  here  that  this  particular  example  is  necessarily  a  use¬ 
ful  method;  only  that  perhaps  it  deserves  more  thought  than  immediate  dismissal. 
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At  any  rate,  allowing  the  use  of  A:>1  should  make  the  writing  of  grammars  easier, 
especially  for  “naive  grammar  writers”.  There  is  no  reason  to  use  a  less  powerful  tech¬ 
nique  when  it  is  completely  subsumed  by  a  more  powerful  one,  so  perhaps  future 
parser  generators  wiLl  include  the  option  of  fc>l  for  the  convenience  of  their  users. 

6.4  Suggestions  for  Future  Work 

When  practical  grammars  were  being  investigated  to  see  where  fc>l  was  necessary, 
several  instances  were  found  of  places  where  a  convenient  way  of  writing  a  grammar 
resulted  in  a  grammar  which  was  not  LR(k)  for  any  k,  even  though  it  wasn’t  ambiguous. 
In  all  of  those  cases,  a  deterministic  parsing  action  could  be  made  if  a  terminal  or  two 
were  looked  at  after  looking  at  some  instance  of  a  syntactic  catagory  such  as  “list  of 
identifiers”  or  "expression”.  For  example: 

statement  ->  proc_id  "('  expr_Jist  *)’ 
statement  ->  array_id  '(’  expr__list  *)’  '=’  expr  Y 
expr_Jist  ->  expr 

|  expr_list  Y  expr 
expr  ->  ‘id’ 
proc__id  ->  ‘id’ 
array_id  -*  ‘id’ 

In  order  to  decide  which  kind  of  ’id’  is  seen,  one  has  to  look  after  the  expr_list ,  which 
can  be  of  any  length. 

A  useful  extension  to  the  work  presented  in  this  thesis  would  be  to  allow  the  use  of 
nonterminal  context  (still  with  the  same  context  string  on  the  left  and  right  sides  of  a 
producion).  That  would  still  be  a  special  case  of  Turnbull’s  work,  though  a  larger  spe¬ 
cial  case  than  terminal  context  grammars.  It  is  likely  that  such  an  extension  would 
enable  the  efficient  parsing  of  almost  all  unambiguous  grammars  that  are  written  for 
programming  languages.  It  would  require,  however,  that  the  context  strings  must  be 
kept  in  table  accessible  at  parsing  time,  and  that  they  be  explicitly  pushed  back  onto 
the  input  queue  after  a  reduction.  This  was  not  necessary  for  TCG  parsers,  so  the 
extension  would  necessarily  lead  to  less  efficient  parsers. 

Another  topic  for  future  investigation  is  that  of  optimizing  TCG  parsers.  For  exam¬ 
ple,  it  might  be  possible  to  avoid  rereading  contexts  after  a  reduction  by  encoding  into 
the  machine  the  state  that  would  be  reached  after  doing  a  reduction  and  reading  the 
terminal  context  (if  there  were  only  one  such  state). 
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It  was  mentioned  earlier  that  an  alternative  to  the  methods  of  this  thesis  would  be 
automatic  methods  Lo  transform  CFGs  into  LALR(l)  grammars.  It  is  quite  possible  that 
the  TCG  generation  methods  can  give  some  inspiration  in  this  area,  because  the  result¬ 
ing  parsers  strongly  resemble  LALR(l)  parsers  so  that  each  argument  made  about 
TCGs  can  probably  be  couched  in  terms  of  CFGs.  It  is  felt,  though,  that  the  TCG  formal¬ 
ism  is  “elegant”,  so  that  more  insights  may  be  gained  by  continuing  to  work  with  them. 
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APPENDIX  A:  SOME  PROOFS 


This  appendix  contains  the  proofs  of  theorems  and  lemmas  which  were  deemed  to  be 
too  long  and  not  important  enough  to  be  contained  in  the  main  body  of  the  thesis. 

Here  is  the  proof  of  the  theorem  in  chapter  4  about  the  Si-CALC  procedure.  Recall 
that  the  theorem  was: 

Theorem.  When  S^-CALC  terminates,  the  following  is  true: 

1)  Si(w)=FfRSTm(L(ciiXiw)),  l^i^n 

2)  The  H  function,  using  the  final  values  of  Si(Wi),  correctly  calculates  Hm( (3)  for 

Proof.  First  it  will  be  shown  that  Si(w)QFIRSTm(L(aiXiw))  and  Let 

Si(w)  represent  the  value  of  S^w)  after  the  k ^  time  through  the  main  loop.  Assume 
that  Si(w)QL(aiXiw).  It  will  be  shown  that  the  equivalent  statement  for  fc  +  1  holds. 

It  is  obviously  true  that 

H(a)cFIRSTm(L(a)),  a£N-T*muT*m 

when  a€T-m.  So  suppose  a=A a j.-.a*.  Then  the  elements  added  to  the  H{a)  set  are 
like: 

FIRSTm(yaj+1...at),  where Axi'w=Aai...a.j,  y€Si(w) 

By  the  induction  hypothesis,  there  is  a  z  eT*  such  that 
A  lav..ajaj+l...at=>aixiwaj+1...at=>*yzaj+l...at 

If  \y  |=m  then  the  added  string  FIRSTm(yaj+i...ai )  is  just  equal  to  y ,  which  is  in 
FIRSTm(L(Aai...at)),  as  required.  If  \y  I  <m  then  z—\  so  that  Aa1...af=>*ya,-+i...af,  and 
again  the  added  string  is  in  FIRSTm(L  (Aa\...  at)). 
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Thus,  it  has  been  shown  that 

H(a)QFIRSTm(L(a)),  a€N- T*muT^m 

Returning  to  the  objective  of  showing  that  Si+l(w)<£L(a.iXiw)  ,  notice  that  the  formula 
for  Si+l(w  )  is 

S*+1(w)=S%Kw)uH(X1-H(Xz...H(Xr)...)) 

By  the  induction  hypothesis  the  Sk(u> )  component  is  correct.  And,  by  the  property  of 
H  just  shown: 

H(Xi-H(Xs,...H(Xr)...))C 

FIRSTm(L(X1-FJRSTm(L(Xz...-J<WSrm(L(Xr))...)))) 

Now  for  languages  generated  by  TCGs,  it  is  true  that 
L(Xl...Xn)=L(XlL(Xz...L(Xn)...)),  XitNvT 

This  is  because  the  expression  on  the  right  mimics  a  rightmost  canonical  derivation, 

/ 

which  for  a  TCG  replaces  the  rightmost  nonterminal  at  each  step. 

So,  replacing  some  terms  by  subsets  of  themselves,  we  get  the  following: 
H(XxH{Xz..,H{Xr)  •  •  •  )) 

cFIRSTm(L(XvFIRSTm(L(X2...-FIRSTm(L(Xr))...)))) 

QFIRSTm(L  (Xi-L  (Xz...-L{Xr)...))) 

QFIRSTm(L  (X \...Xr)) 

This  proves  that  Sk+1(w)QL  (a^w). 

The  base  case  of  the  induction,  fc  =  0,  is  trivial,  so  that  it  follows  by  induction  that 
Si(w)CL  (oiiXiw)  at  the  termination  of  the  procedure. 

Now  the  second  half  of  the  theorem  will  be  proved:  that  S^-CALC  calculates  all  the 
strings  in  Hm(aiXiw).  Again,  a  proof  by  induction  will  be  used. 

Suppose  that  Si-CALC  finds  all  strings  in  which  result  from  derivations  of 

length  k  or  less  (where  fi€.N ■  Si-CAJjC  clearly  works  for  k=l  because  it  only 

needs  to  calculate  H(a  ^-H (az...-H  (as)...)),  a^eT,  which  is  just  FIRSTm(a  v . .  as) . 

Consider  a  derivation  of  length  k  + 1: 

Axxw  =>cx^xzvj  =>kz 


Suppose  that 


aixiw=y0Aily1Ai2.'.yt-iAityt,  y^T*.  A^eN 


The  derivation  of  z  can  be  split  into  t  major  steps,  each  involving  a  derivation  of  length 
^k: 

Aityt=>mzt 

Au-Wt-i^Zt-i 


^il3/l22=>*2l  =  z 

By  the  induction  hypothesis,  Si-CALC  will  correctly  include  FIRSTm(zj)  in  the 
FIRSTm(L  (AjyjZj+i))  set,  l^j^t  (letting  zf  +  1=\).  When  calculating  Hm(aiXi it/),  Si-CALC 
will  add  equivalent  to  the  following  to  Si(ui): 

H  (FIRSTm  (y0-  H  (FIRSTm  (An-H  {..,H  (FIRSTm  (yt)).. .))))) 

=H  (FIRSTm  (..,H(FfRSTm(Auy))...)) 

3FIRSTm(z) 

The  above  uses  the  fact  that  each  production  needs  a  right  context  of  at  most  m,  so 
that  chopping  off  the  intermediate  strings  with  FIRSTm  doesn’t  affect  the  applicability 
of  productions. 

Therefore,  Si-CALC  includes  strings  in  the  Hm  sets  resulting  from  derivations  of 
length  fc  +  1,  and  by  induction,  all  the  strings  that  should  be  there  at  termination.  This 
concludes  the  proof  of  the  theorem.  Q 

Chapter  4  made  reference  to  a  lemma  which  was  developed  to  give  the  relationship 
between  parsers  for  full  TCGs  and  LR(k)  parsers.  It  uses  some  terms  given  in  the  fol¬ 
lowing  definition. 

Definition.  If  G  is  a  CFG  and  C  is  any  fc-TCG  formed  by  replacing  some 
productions  in  G  by  ARC  of  them,  then  call  G  the  core  of  G\  In  the 
machine  M(G’ ),  those  transitions  generated  from  items  like  (Ax  -*afB  a2x) 

(i.e.,  where  B  is  not  in  the  terminal  context  part  of  the  production)  will  be 
called  core  symbol  reads.  (Note  that  calling  a  transition  a  core  symbol 
read  does  not  preclude  the  possibility  that  the  same  transition  is  also 


A- 3 


followed  for  a  non-core  symbol  item.)  Those  items  where  the  mark  is 
somewhere  before  the  x  of  a  production  Ax  -+ax  will  be  called  simply  core 
items.  Q 

Lemma.  Suppose  G'  is  the  full  fc-TCG  formed  from  the  CFG  G  by  taking  ARCk  of  each 
production,  then  M(G' )  and  MLR^){G)  are  related  as  follows: 

a)  Let  M'(G')  be  formed  by  removing  all  non-core  transitions  from  M{G')t  and  the 
states  which  are  thus  inaccessible  from  the  start  state.  In  the  resulting 
machine,  there  may  be  pairs  of  states  whose  non-core  items  are  the  same: 
merge  such  states  together  (it  is  guaranteed  that  their  successor  states  will  also 
be  so  equivalent)  and  form  a  new  machine  M"(G').  Let  Mlr(ic)'(G  )  be  formed  by 
removing  all  lookahead  transitions  in  MLR(k)(G),  and  the  states  which  are  thus 
inaccessible  from  the  start  state. 

Then  M"  (G’ )  and  Mip^k){G)  are  equivalent  in  the  finite  automaton  sense. 

b)  All  the  additional  states  and  reads  in  M(G')  can  be  derived  from  the  lookahead 

transitions  in  Mip(k){G)  by  the  following  procedure.  Suppose  there  is  a  look¬ 
ahead  transition  in  Mip(k)(G)  from  state  q  on  a  string  a  10.3... a*  (a i^T).  Then  in 
M(G')  there  will  be  read  transitions  of  a j,..., afc.  If  there  is  a  core  symbol  read 
from  q  of  a\  then  follow  it  to  q',  and  repeat  for  q’,  etc.,  until  a  state  is  reached 
where  there  is  no  core  symbol  read  for  a Do  nothing  if  there  is  no  such  state. 
Then  M(G')  will  have  some  new  states  and  transitions  whose  sole  purpose  is  to 
read  There  will  be  an  apply  in  the  state  reached  by  reading  ak. 

This  says  that  M(G ')  and  Mip^k)(G)  are  essentially  the  same.  They  act  identically  (go 
through  equivalent  states)  on  the  core  symbols  of  the  grammar;  the  only  difference 
between  the  two  is  that  after  the  core  symbols  have  been  read,  MiR(k){G)  looks  ahead 
at  k  symbols  and  then  applies  the  core  production,  whereas  M(G')  read  the  k  symbols 
one  at  a  time  through  a  number  of  states  and  then  applies  the  ARC'd  production.  After 
the  applies,  the  machines  again  end  up  in  equivalent  states. 

The  effect  of  the  relationship  described  is  that  both  M(G')  and  MfjR^)(G)  parse  all 
strings  in  the  same  manner,  and  they  have  approximately  the  same  number  of  slates. 

Proof.  Essentially,  the  reason  why  this  lemma  is  true  is  that  the  M(G’)  core  items  and 
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‘corresponding’  MiR[k)(G)  items  contain  equivalent  information.  It  is  just  that 
Mlr{ic)(G )  items  carry  fc-symbol  context  strings  separately  and  M(G')  items  carry  the 
same  strings  as  part  of  the  ARC' d  productions.  For  convenience,  let  this  correspon¬ 
dence  be  characterized  by  a  function  /  mapping  MiR(k){G)  items  to  M(G')  items: 

/  {(A  ->a*{3,x))  =  (Ax->oi*{3x ) 

The  item  (Ax-^a’Px)  will  be  a  valid  item  for  G’  if  (A ->  a*/?,  x )  is  an  item  in  Mir^{G), 
because  the  latter  fact  implies  that  x€FOLLOWk(A)  and  thus,  that 
Ax->a(3x  €  ARCk(A  -*<x{3).  It  is  also  true  that  /”*,  mapping  core  items  of  M(G')  into 
items  of  MiR(k){G  ),  is  a  function. 

First,  it  will  be  shown  that  the  LR(k)  closure  function  and  the  closure  function  of 
PGEN  work  analagously.  The  Knuth  closure  for  the  LR(lc)  parser  generator  is 

CLR{k)(A  -*afB  a2,y)={(B  -»•/?,  z)  |  z  eHk(a2y)] 

Here,  Hk{a )  is  Knuth’s  function,  which  is  the  same  as  Hm(a)  except  that  only  fc-letter 
strings  are  in  the  Hk(a)  set:  Hm{a.)  allows  strings  of  length  <k  if  they  are  the  whole 
string  generated  from  a.  The  closure  function  for  M(G' )  is 

C  (Ax->afB  a2x)  =  {(Bu  )  |  ueHm(^zx),  Tn=k\ 

Now,  because  this  is  a  full  fc-TCG,  x  will  always  be  of  length  k.  Hence,  the  closure  func¬ 
tion  always  has  enough  context,  and  the  abort  problem  can’t  occur. 

It  can  easily  be  seen  that  given  any  Mir (k)(G )  set  S, 
f(ClR{k)(S))=C(f(.S)) 

And,  because  non-core  items  only  have  terminal  symbols  marked  and  thus  do  not  con¬ 
tribute  anything  new  to  closures,  it  is  true  that  given  any  M(G')  set  S', 

/-‘(C"  (S'))=CLRlt)(f-'(S’)) 

It  can  be  concluded  that  there  is  a  one-to-one  correspondence  between  the  items  in 
CLR(k)(S)  and  C'(f(S)). 

The  initial  state  set  for  Mir^  is  (Sq-^cxq,  j_k )  (where  i-s  the  goal  production). 

The  initial  state  set  for  M(G')  is  /((So-L^^o  1*)).  so  the  initial  states  are  what  could 
be  called  ‘/-equivalent’. 
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The  initial  state  sets  for  the  next  states  are  also  /-equivalent  except  for  the  follow¬ 
ing  circumstance.  Suppose  MUi(k){G)  has  an  item  {Ax^ax^nx  1»a.r2).  Then  the  next 
state  for  a  read  transition  for  a  will  include  {Ax \ax2-*ax \a*x%).  MiR(k){G)  will  include 
no  such  item  because  an  apply  transition  will  have  been  generated  when  the  end  of  the 
(core)  production  was  reached.  However,  except  for  non-core  items,  the  next  states 
will  still  be  /-equivalent  because  C'  of  non-core  items  is  always  0.  The  non-core  sym¬ 
bols  will  come  into  M{G')  in  the  manner  described  in  (b)  above. 

When  trying  to  see  whether  a  state  already  exists,  a  slight  complication  arises:  there 
may  be  two  states  in  M{G')  which  differ  only  by  non-core  items.  In  Mut(k.){G ),  there 
would  only  be  one  state.  The  merging  described  in  (a)  is  to  take  care  of  this.  Both 
paths  of  M{G')  will  continue  identically  to  the  single  path  of  Mlr{j.){G)  (ignoring  the 
non-core  items  in  the  continuation  states),  because  the  latter  do  not  affect  the  clo¬ 
sures. 

Thus,  by  induction,  the  lemma  has  been  shown.  Q 
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C.J.M.  Turnbull,  September  1972 

CSRG-19  PROJECT  SUE  AS  A  LEARNING  EXPERIENCE 

K. C.  Sevcik  et  al,  September  1972 
[Proceedings  AFIPS  Fall  Joint  Computer  Conference, 
v.  41,  December  1972] 

*  CSRG-20  A  STUDY  OF  LANGUAGE  DIRECTED  COMPUTER  DESIGN 

David  B.  Wortman,  December  1972 

[Ph.D.  Thesis,  Computer  Science  Department, 

Stanford  University,  1972] 

CSRG-21  AN  APL  TERMINAL  APPROACH  TO  COMPUTER  MAPPING 
R.  Kvaternik,  December  1972 
[M.Sc.  Thesis,  DCS.  1972] 

*  CSRG-22  AN  IMPLEMENTATION  LANGUAGE  FOR  MINICOMPUTERS 

G.G.  Kalmar,  January  1973 
[M.Sc.  Thesis,  DCS,  1972] 

CSRG-23  COMPILER  STRUCTURE 

W.M.  McKeeman,  January  1973 

[Proceedings  of  the  USA-Japan  Computer  Conference,  1972] 


CSRG-24  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.D.  Gannon  (ed.),  March  1973 

CSRG-25  THE  INVESTIGATION  OF  SERVICE  TIME  DISTRIBUTIONS 
Eleanor  A.  Lester,  April  1973 
[M.Sc.  Thesis,  DCS,  1973] 

CSRG-26  PSYCHOLOGICAL  COMPLEXITY  OF  COMPUTER  PROGRAMS:' 

AN  INITIAL  EXPERIMENT 
Larry  Weissman,  August  1973 

CSRG-27  STRUCTURED  SUBSETS  OF  THE  PL/I  LANGUAGE 

Richard  C.  Holt  and  David  B.  Wortman,  October  1973 

CSRG-28  ON  REDUCED  MATRIX  REPRESENTATION  OF  LR(k) 

PARSER  TABLES 

Marc  Louis  Joliat,  October  1973 

[Ph.D.  Thesis,  EE  1973] 

CSRG-29  A  STUDENT  PROJECT  FOR  AN  OPERATING  SYSTEMS  COURSE 
B.  Czarnilc  and  D.  Tsichritzis  (eds. ).  November  1973 

CSRG-30  A  PSEUDO-MACHINE  FOR  CODE  GENERATION 
Henry  John  Pasko,  December  1973 
[M.Sc.  Thesis,  DCS  1973] 

CSRG-31  AN  ANNOTAED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM  ENGINEERING 
J.D.  Gannon  (ed.),  Second  Edition,  March  1974 

CSRG-32  SCHEDULING  MULTIPLE  RESOURCE  COMPUTER  SYSTEMS 

E. D.  Lazowska,  May  1974 
[M.Sc.  Thesis,  DCS,  1974] 

CSRG-33  AN  EDUCATIONAL  DATA  BASE  MANAGEMENT  SYSTEM 

F.  Lochovsky  and  D.  Tsichritzis,  May  1974 
[INFOR.  14  (3),  pp. 270-278,  1976] 

CSRG-34  ALLOCATING  STORAGE  IN  HIERARCHICAL  DATA  BASES 
P.  Bernstein  and  D.  Tsichritzis,  May  1974 
[Information  Systems  Journal,  v.  1,  pp.  133-140] 

CSRG-35  ON  IMPLEMENTATION  OF  RELATIONS 
D.  Tsichritzis,  May  1974 

CSRG-36  SIX  PL/I  COMPILERS 

D.B.  Wortman,  P.J.  Khaiat,  and  D.M.  Lasker,  August  1974 
[Software  Practice  and  Experience,  v.6,  n.3, 

July-Sept.  1976] 

CSRG-37  A  METHODOLOGY  FOR  STUDYING  THE  PSYCHOLOGICAL  COMPLEXITY 
OF  COMPUTER  PROGRAMS 
Laurence  M.  Weissman,  August  1974 
[Ph.D.  Thesis,  DCS,  1974] 
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*  CSRG-38  AN  INVESTIGATION  OF  A  NEW  METHOD  OF  CONSTRUCTING  SOFTWARE 

David  M.  Lasker,  September  1974 
[M.Sc.  Thesis,  DCS,  1974] 

CSRG-39  AN  ALGEBRAIC  MODEL  FOR  STRING  PATTERNS 
Glenn  F.  Stewart,  September  1974 
[M.Sc.  Thesis,  DCS,  1974] 

*  CSRG-40  EDUCATIONAL  DATA  BASE  SYSTEM  USER’S  MANUAL 

J.  Klebanoff,  F.  Lochovsky,  A.  Rozitis,  and 

D.  Tsichritzis,  September  1974 

*  CSRG-41  NOTES  FROM  A  WORKSHOP  ON  THE  ATTAINMENT  OF 

RELIABLE  SOFTWARE 

David  B.  Wortman  (ed.),  September  1974 

*  CSRG-42  THE  PROJECT  SUE  SYSTEM  LANGUAGE  REFERENCE  MANUAL 

B.L.  Clark  and  F.J.B.  Ham,  September  1974 

*  CSRG-43  A  DATA  BASE  PROCESSOR 

E. A.  Ozkarahan,  S.A.  Schuster  and  K.C.  Smith, 

November  1974  [Proceedings  National  Computer 
Conference  1975,  v.44,  pp. 379-388] 

*  CSRG-44  MATCHING  PROGRAM  AND  DATA  REPRESENTATION  TO  A 

COMPUTING  ENVIRONMENT 
Eric  C.R.  Hehner,  Novemver  1974 
[Ph.D.  Thesis,  DCS,  1974] 

See  Computer,  Vol.9,  No. 9,  August  1976,  pp. 65-70. 

*  CSRG-45  THREE  APPROACHES  TO  RELIABLE  SOFTWARE;  LANGUAGE  DESIGN, 

DYADIC  SPECIFICATIONS,  COMPLEMENTARY  SEMANTICS 
J.E.  Donahue,  J.D.  Gannon,  J.V.  Guttag  and 
J.J.  Horning,  December  1974 

CSRG-46  THE  SYNTHESIS  OF  OPTIMAL  DECISION  TREES  FROM 
DECISION  TABLES 

Helmut  Schumacher,  December  1974 

[M.Sc.  Thesis,  DCS,  1974;  CACM,  v.19,  n.6,  June  1976] 

*  CSRG-47  LANGUAGE  DESIGN  TO  ENHANCE  PROGRAMMING  RELIABILITY 

John  D.  Gannon,  January  1975 
[Ph.D.  Thesis,  DCS.  1975] 

*  CSRG-48  DETERMINISTIC  LEFT  TO  RIGHT  PARSING 

Christopher  J.M.  Turnbull,  January  1975 
[Ph.D.  Thesis,  EE,  1974] 

*  CSRG-49  A  NETWORK  FRAMEWORK  FOR  RELATIONAL  IMPLEMENTATION 

D.  Tsichritzis,  February  1975  [in  Data  Base  Description, 

Dongue  and  Nijssen  (eds.),  North  Holland  Publishing  Co.] 
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*  CSRG-50  A  UNIFIED  APPROACH  TO  FUNCTIONAL  DEPENDENCIES 

AND  RELATIONS 

P.A.  Bernstein,  J.R.  Swenson  and  D.C.  Tsichritzis 
February  1975  [Proceedings  of  the  ACM  S1GM0D 
Conference,  1975] 

*  CSRG-51  ZETA:  A  PROTOTYPE  RELATIONAL  DATA  BASE  MANAGEMENT  SYSTEM 

M.  Brodie  (ed).  February  1975  [Proceedings  Pacific  ACM 
Conference,  1975] 

*  CSRG-52  AUTOMATIC  GENERATION  OF  SYNTAX-REPAIRING  AND 

PARAGRAPHING  PARSERS 
David  T.  Barnard,  March  1975 
[M.Sc.  Thesis,  DCS,  1975] 

*  CSRG-53  QUERY  EXECUTION  AND  INDEX  SELECTION  FOR  RELATIONAL 

DATA  BASES 

J.H.  Gilles  Farley  and  Stewart  A.  Schuster,  March  1975 

CSRG-54  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

J.V.  Guttag  (ed.),  Third  Edition,  April  1975 

CSRG-55  STRUCTURED  SUBSETS  OF  THE  PL/1  LANGUAGE 

Richard  C.  Holt  and  David  B.  Wortman,  May  1975 

*  CSRG-56  FEATURES  OF  A  CONCEPTUAL  SCHEMA 

D.  Tsichritzis,  June  1975  [Proceedings  Very  Large 
Data  Base  Conference,  1975] 

*  CSRG-57  MERLIN:  TOWARDS  AN  IDEAL  PROGRAMMING  LANGUAGE 

Eric  C.R.  Hehner,  July  1975 

see  Acta  Informatica  Col.  10,  No. 3,  pp. 229-243,  197B 

CSRG-58  ON  THE  SEMANTICS  OF  THE  RELATIONAL  DATA  MODEL 
Hans  Albrecht  Schmid  and  J.  Richard  Swenson, 

July  1975  [Proceedings  of  the  ACM  SIGMOD  Conference,  1975] 

*  CSRG-59  THE  SPECIFICATION  AND  APPLICATION  TO  PROGRAMMING 

OF  ABSTRACT  DATA  TYPES 
John  V.  Guttag,  September  1975 
[Ph.D.  Thesis.  DCS,  1975] 

*  CSRG-60  NORMALIZATION  AND  FUNCTIONAL  DEPENDENCIES  IN  THE 

RELATIONAL  DATA  BASE  MODEL 
Phillip  Alan  Bernstein,  October  1975 
[Ph.D.  Thesis,  DCS,  1975] 

*  CSRG-61  LSL:  A  LINK  AND  SELECTION  LANGUAGE 

D.  Tsichritzis,  November  1975  [Proceedings  ACM 
SIGMOD  Conference,  1976] 
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*  CSRG-62  COMPLEMENTARY  DEFINITIONS  OF  PROGRAMMING  LANGUAGE 

SEMANTICS 

James  E.  Donahue,  November  1975 
[Ph.D.  Thesis,  DCS,  1975] 

CSRG-63  AN  EXPERIMENTAL  EVALUATION  OF  CHESS  PLAYING  HEURISTICS 
Lazio  Sugar,  December  1975 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-64  A  VIRTUAL  MEMORY  SYSTEM  FOR  A  RELATIONAL  ASSOCIATIVE 
PROCESSOR 

S.A.  Schuster,  E.A.  Ozkarahan,  and  K.C.  Smith, 

February  1976  [Proceedings  National  Computer 
Conference  1976,  v.45,  pp. 855-862] 

CSRG-65  PERFORMANCE  EVALUATION  OF  A  RELATIONAL  ASSOCIATIVE 
PROCESSOR 

E.A.  Ozkarahan,  S.A.  Schuster,  and  K.C.  Sevcik, 

February  1976  [ACM  Transactions  on  Database 
Systems,  v.l,  n:4,  December  1976] 

CSRG-66  EDITING  COMPUTER  ANIMATED  FILM 
Michael  D.  Tilson,  February  1976 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-67  A  DIAGRAMMATIC  APPROACH  TO  PROGRAMMING  LANGUAGE 
SEMANTICS 

James  R.  Cordy,  March  1976 
[M.Sc.  Thesis,  DCS,  1976] 

*  CSRG-68  A  SYNTHETIC  ENGLISH  QUERY  LANGUAGE  FOR  A  RELATIONAL 

ASSOCIATIVE  PROCESSOR 

L.  Kerschberg,  E.A.  Ozkarahan,  and  J.E.S.  Pacheco, 

April  1976 

CSRG-69  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

D.  Barnard  and  D.  Thompson  (eds.),  Fourth  Edition, 

May  1976 

*  CSRG-70  A  TAXONOMY  OF  DATA  MODELS 

L.  Kerschberg,  A  Klug,  and  D.Tsichritzis,  May  1976 
[Proceedings  Very  Large  Data  Base  Conference,  1976] 

*  CSRG-71  OPTIMIZATION  FEATURES  FOR  THE  ARCHITECTURE  OF  A 

DATA  BASE  MACHINE 

E. A.  Ozkarahan  and  K.C.  Sevcik,  May  1976 

[ACM  Transactions  of  Database  Systems,  v.2,  n.4,  December  1977] 

*  CSRG-72  THE  RELATIONAL  DATA  BASE  SYSTEM  OMEGA  -  PROGRESS  REPORT 

H.A.  Schmid  (ed.),  P.A.  Bernstein  (ed.),  B.  Arlow, 

R.  Baker  and  S.  Pozgaj,  July  1976 
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CSRG-73  AN  ALGORITHMIC  APPROACH  TO  NORMALIZATION  OF  . 

RELATIONAL  DATA  BASE  SCHEMAS 

P.A.  Bernstein  and  C.  Beeri,  September  1976 

*  CSRG-74  A  HIGH-LEVEL  MACHINE-ORIENTED  ASSEMBLER  LANGUAGE 
FOR  A  DATA  BASE  MACHINE 

E.A.  Ozkarahan  and  S.A.  Schuster,  October  1976 

CSRG-75  DO  CONSIDERED  OD:  A  CONTRIBUTION  TO  THE  PROGRAMMING 
CALCULUS 

Eric  C.R.  Hehner,  November  1976 
Acta  Informatica  to  appear  1979 

CSRG-76  SOFTWARE  HUT:  A  COMPUTER  PROGRAM  ENGINEERING 
PROJECT  IN  THE  FORM  OF  A  GAME 
J.J.  Horning  andD.B.  Wortman,  November  1976 
[IEEE  Transactions  on  Software  Engineering,  v.SE-3,  n.4,  July  1977] 

CSRG-77  A  SHORT  STUDY  OF  PROGRAM  AND  MEMORY  POLICY  BEHAVIOUR 
G.  Scott  Graham,  January  1977 

CSRG-78  A  PANACHE  OF  DBMS  IDEAS 

D.  Tsichritzis  (ed.),  February  1977 

CSRG-79  THE  DESIGN  AND  IMPLEMENTATION  OF  AN  ADVANCED  LALR 
PARSE  TABLE  CONSTRUCTOR 
David  H.  Thompson,  April  1977 
[M.Sc.  Thesis,  DCS,  1976] 

CSRG-80  AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM 
ENGINEERING 

D.  Barnard  (ed.),  Fifth  Edition,  May  1977 

CSRG-81  PROGRAMMING  METHODOLOGY:  AN  ANNOTATED  BIBLIOGRAPHY 
FOR  1F1P  WORKING  GROUP  2.3 

Sol  J.  Greenspan  and  J.J.  Horning  (eds.),  First  Edition,  May  1977 
CSRG-B2  NOTES  ON  EUCLID 

edited  by  W.  David  Elliot  and  David  T.  Barnard,  August  1977 

CSRG-83  TOPICS  IN  QUEUEING  NETWORK  MODELING 
edited  by  G.  Scott  Graham,  July  1977 

CSRG-84  TOWARD  PROGRAM  ILLUSTRATION 

Edward  Yarwood,  September  1977 
[M.Sc.  Thesis,  DCS.  1974] 

CSRG-85  CHARACTERIZING  SERVICE  TIME  AND  RESPONSE  TIME 

DISTRIBUTIONS  IN  QUEUEING  NETWORK  MODELS  OF  COMPUTER 
SYSTEMS 

Edward  D.  Lazowska,  September  1977 
[Ph.D.  Thesis,  DCS,  1977] 
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CSRG-86  MEASUREMENTS  OF  COMPUTER  SYSTEMS  FOR  QUEUEING 
NETWORK  MODELS 
Martin  G.  Kienzle,  October  1977 

[M.Sc.  Thesis,  DCS,  1977;  Proc.  Int.  Symp.  on  Modelling  and  Performance 
Evaluation  of  Computer  Systems,  Vienna,  1979] 

CSRG-S7  ’OLGA’  LANGUAGE  REFERENCE  MANUAL 

B.  Abourbih,  H.  Trickey,  D.M.  Lewis,  E.S.  Lee, 

P.I.P.  Boulton,  November  1977 

CSRG-88  USING  A  GRAMMATICAL  FORMALISM  AS  A  PROGRAMMING  LANGUAGE 
Brad  A.  Silverberg,  January  1978 
[M.Sc.  Thesis,  DCS,  1978] 

CSRG-89  ON  THE  IMPLEMENTATION  OF  RELATIONS:  A  KEY  TO  EFFICIENCY 
Joachim  W.  Schmidt,  January  1978 

CSRG-90  DATA  BASE  MANAGEMENT  SYSTEM  USER  PERFORMANCE 
Frederick  H.  Lochovsky,  April  1978 
[Ph.D.  Thesis,  DCS,  1978] 

CSRG-91  SPECIFICATION  AND  VERIFICATION  OF  DATA  BASE 
SEMANTIC  INTEGRITY 
Michael  Lawrence  Brodie,  April  1978 
[Ph.D.  Thesis,  DCS,  1978] 

CSRG-92  STRUCTURED  SOUND  SYNTHESIS  PROJECT  (SSSP): 

AN  INTRODUCTION 

by  William  Buxton,  Guy  Fedorkow,  with  Ronald  Baecker, 

Gustav  Ciamaga,  Leslie  Mezei  and  K.C.  Smith,  June  1978 

CSRG-93  A  DEVICE-INDEPENDENT, GENERAL-PURPOSE  GRAPHICS  SYSTEM 
IN  A  MINICOMPUTER  TIME-SHARING  ENVIRONMENT 
William  T.  Reeves,  August  1978 
[M.Sc.  Thesis,  DCS,  1976] 

CSRG-94  ON  THE  AXIOMATIC  VERIFICATION  OF 
CONCURRENT  ALGORITHMS 
Christian  Lengauer,  August  1978 
[M.Sc.  Thesis.  DCS,  1978] 

CSRG-95  PISA:  A  PROGRAMMING  SYSTEM  FOR  INTERACTIVE 
PRODUCTION  OF  APPLICATION  SOFTWARE 
Rudolf  Marty,  August  1978 

CSRG-96  ADAPTIVE  MICROPROGRAMMING  AND  PROCESSOR  MODELING 
Walter  G.  Rosocha 
[Ph.D.  Thesis,  EE,  August  1978] 

CSRG-97  DESIGN  ISSUES  IN  THE  FOUNDATION  OF  A  COMPUTER-BASED 
TOOL  FOR  MUSIC  COMPOSITION 
William  Buxton 

[M.Sc.  Thesis,  CSRG,  October  1978] 
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CSRG-98  THEORY  OF  DATABASE  MAPPINGS 
Anthony  C.  Klug 

[Ph.D.  Thesis,  DCS,  December  1978] 

CSRG-99  HIERARCHICAL  COROUTINES:  A  MECHANISM  FOR  IMPROVED 
PROGRAM  STRUCTURE 
Leonard  I.  Vanek,  February  1979 

CSRG-100  TOPICS  IN  PERFORMANCE  EVALUATION 
G.  Scott  Graham  (ed.),  July  1979 

CSRG-101  A  PANACHE  OF  DBMS  IDEAS  II 

F.H.  Lochovsky  (ed.).  May  1979 

CSRG-102  A  SIMPLE  SET  THEORY  FOR  COMPUTING  SCIENCE 
Eric  C.R.  Hehner,  May  1979 

CSRG-103  THE  CENTRALIZED  ALGORITHM  IN  DISTRIBUTED  SYSTEMS 
Ernest  J.H.  Chang 
[Ph.D.  Thesis,  DCS,  July  1979] 

CSRG-104  ELIMINATING  THE  VARIABLE  FROM  DIJKSTRA’S 
MINI-LANGUAGE 
D.  Hugh  Redelmeier,  July  1979 

CSRG-105  A  LANGUAGE  FACILITY  FOR  DESIGNING  INTERACTIVE 
DATABASE-INTENSIVE  APPLICATIONS 
John  Mylopoulos,  Philip  A.  Bernstein,  Harry  K.T.  Wong, 
July  1979 

CSRG-106  ON  APPROXIMATE  SOLUTION  TECHNIQUES  FOR 

QUEUEING  NETWORK  MODELS  OF  COMPUTER  SYSTEMS 
Satish  Kumar  Tripathi,  July  1979 

CSRG-107  A  FRAMEWORK  FOR  VISUAL  MOTION  UNDERSTANDING 
John  K.  Tsotsos,  John  Mylopoulos;  H.  Dominic  Cowey 
Steven  W.  Zucker,  DCS,  June  1979 

CSRG-108  DIALOGUE  ORGANIZATION  AND  STRUCTURE  FOR 
INTERACTIVE  INFORMATION  SYSTEMS 
John  Leonard  Barron 
[M.Sc.  Thesis,  DCS,  1980] 

CSRG-109  A  UNIFYING  MODEL  OF  PHYSICAL  DATABASES 
D.S.  Batory,  C.C.  Gotlieb,  April  1980 

CSRG-110  OPTIMAL  FILE  DESIGNS  AND  REORGANIZATION  POINTS 
D.S.  Batory,  April  1980 

CSRG-1 11  A  PANACHE  OF  DBMS  IDEAS  III 
D.  Tsichritzis  (ed.),  April  1980 
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CSRG-1 12  TOPICS  IN  PSN  -  II:  EXCEPTIONAL  CONDITION 

HANDLING  IN  PSN;  REPRESENTING  PROGRAMS  IN  PSN:  • 
CONTENTS  IN  PSN 

Yves  Lespcrance,  Byran  M.  Kramer,  Peter  F;  Schneider 
April,  1980.- 

CSRG-1 13  SYSTEM-ORIENTED  MACRO-SCHEDULING 
C.C.  Gotlieb  and  A.  Schonbach 
May  1900 

CSRG-1 14  A  FRAMEWORK  FOR  VISUAL  MOTION. UNDERSTANDING 
John  Konstantine  Tsotsos 
[Ph.D.  Thesis,  DCS,  June  1900] 

CSRG-1  15  SPECIFICATION  OF  CONCURRENT  EUCLID 
James  R.  Cordy  and  Richard  C.  Holt 
July  1980 

CSRG-1 16  THE  REPRESENTATION  OF  PROGRAMS  IN  THE 

PROCEDURAL  SEMANTIC  NETWORK  FORMALISM 
Bryan  M.  Kramer 
[M.Su.  Thesis,  DCS,  1980] 

CSRG-1 17  CONTEXT-FREE  GRAMMARS  AND  DERIVATION  TREES  AS  : 
PROGRAMMING  TOOLS 
Volker  Linnemann 
September.  1930 

CSRG-1 13  S/SL:  SYNTAX/SEMANTIC  LANGUAGE 
INTRODUCTION  AND  SPECIFICATION 
R.C.  Holt.  J;R.  Cordy,  D.B.  Wortman 
CSRG,  September  1880 

CSRG-1 19  PT:  A  PASCAL  SUBSET 
Alan  Rosselet 

[M.Se,  Thesis,  DCS,  October  1980] 

CSRG- 120  PTED:  A  STANDARD  PASCAL  TEXT  EDITOR  BASED  ON 
THE  KERNIGHAN:  AND  PLAUGER  DESIGN 
Ken  Newman,  DCS 
October  1980 

CSRG- 121  TERMINAL  CONTEXT  GRAMMARS 
Howard  W.  Tricksy 
[M.Sc.  Thesis,  EE,  September  1980] 


